Reinforcement Learning — Beginner
Train a virtual pet with rewards and learn reinforcement learning from scratch.
This course is a short, book-style path into reinforcement learning (RL) for absolute beginners. You won’t need any background in AI, math, coding, or data science. Instead of starting with heavy theory, you will learn RL the way it is easiest to understand: by training a virtual pet using rewards.
Reinforcement learning is “learning by trying.” An agent takes an action, sees what happens, receives a reward (good or bad), and slowly builds better behavior over time. In this course, your agent is the “pet brain,” and the environment is the “pet world” with clear rules: hunger can go up, energy can go down, and some actions help more than others. By the end, you’ll have a complete beginner-friendly RL project you can explain and extend.
You will design a small world where a pet must stay healthy and happy. The pet can choose actions like feeding, playing, or resting. At first, it will act randomly. Then you’ll teach it with a simple learning method (a Q-table) so it starts choosing better actions more often.
Many RL tutorials assume you already know programming and probability words. This course explains each idea from first principles and keeps the project small enough to understand fully. You’ll learn what each piece is for, how to test it, and how to fix it when learning goes wrong.
Chapter 1 introduces RL in plain language using the virtual pet story. Chapter 2 turns the story into a real environment: states, actions, rewards, and “episode ends.” Chapter 3 introduces the Q-table, a simple memory that helps the pet learn from experience. Chapter 4 teaches the key training skill in RL: balancing curiosity (exploration) and using what you already know (exploitation). Chapter 5 focuses on measurement and reward design so you can make the pet learn the right thing (not just a trick). Chapter 6 helps you package the project, test it fairly, and understand what to learn next.
This course is for anyone who wants to understand reinforcement learning without being overwhelmed. It works well for students, career switchers, product managers, analysts, and anyone curious about how “agents” learn behaviors.
If you’re ready to train your first agent and finally understand what reinforcement learning is doing under the hood, jump in and follow the steps chapter by chapter. Register free to begin, or browse all courses to see related learning paths.
Machine Learning Engineer, Reinforcement Learning Educator
Sofia Chen builds practical machine learning systems and specializes in teaching reinforcement learning to beginners. She focuses on clear, step-by-step explanations and small projects that feel like real progress. Her courses are designed for learners starting from zero and aiming to build working agents quickly.
Reinforcement learning (RL) is the most “human-feeling” style of machine learning: the system learns by trying things, seeing what happens, and remembering what worked. In this course, you’ll train a simple virtual pet. The pet won’t “understand” the world the way you do, and it won’t read instructions. Instead, it will follow a loop: observe its situation, take an action, receive a reward (good or bad), and repeat. Over time it will develop preferences for actions that lead to better outcomes.
This chapter builds the mental model you’ll use for the rest of the book. You’ll meet the pet and define what it can do, then translate that into a tiny environment with rules you can explain in one minute. You’ll also learn to describe an RL problem without math: identify the agent, the environment, the actions, and the rewards—and define what “better” means in a measurable way.
As you read, keep one engineering idea in mind: RL is not magic; it’s a feedback system. If you design the feedback (states and rewards) well, learning looks impressive. If you design it poorly, the pet will learn weird habits quickly and confidently. The skill is not only running training, but shaping the problem so the training signal matches what you actually want.
By the end of this chapter you should be able to describe an RL setup in plain English and predict how changes to rewards or available actions might change behavior. That ability—clear problem formulation—is the foundation for building a Q-table and improving it safely in later chapters.
Practice note for Meet the virtual pet: what it can do and what it should learn: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for The RL loop: observe, act, get reward, repeat: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Actions, rewards, and goals: how behavior is shaped: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Your first tiny environment: rules you can explain in one minute: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: describe an RL problem without math: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Meet the virtual pet: what it can do and what it should learn: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for The RL loop: observe, act, get reward, repeat: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
“Learning by rewards” means the agent isn’t given the correct action up front. Instead, it experiments, and the environment scores the result. A reward is just a number the environment returns after an action. Positive rewards encourage a behavior; negative rewards discourage it. The key is that the agent’s goal is to maximize reward over time, not to “do what humans would do.” This is why reward design is the heart of reinforcement learning.
Meet our virtual pet: imagine a small creature with needs (hunger, boredom, energy) that change as time passes. The pet can do a few things—eat, sleep, play, or do nothing. You decide what “good pet care” means by setting rewards. For example, if the pet eats when hungry, give a positive reward; if it ignores hunger for too long, give a negative reward. The pet will eventually prefer actions that tend to produce higher total reward.
A common beginner mistake is to treat rewards like praise for “nice” behavior rather than a precise control signal. In engineering terms, reward is your specification. If you reward “play” every time, the pet may play constantly and starve. If you punish “sleep” too strongly, the pet may learn to stay awake even when exhausted because it avoids punishment in the short term. Good RL setups use rewards to encode trade-offs and long-term consequences.
Practical outcome: you should be able to write down two or three sentences that explain what the pet is trying to maximize (total reward) and how the environment will judge each action (reward rules). That’s already an RL problem, even before any code.
RL becomes clear when you separate “agent” from “environment.” The agent is the decision-maker: it chooses actions. The environment is everything else: it provides observations (state information), applies the consequences of actions, and returns rewards. The environment also controls the rules of the world—what changes, what is allowed, and when an episode ends.
In our virtual pet scenario, the pet is the agent. The world around it—its hunger level increasing over time, energy decreasing when playing, the reward for eating when hungry—is the environment. This division matters because you can only train what the agent controls. If the pet keeps getting hungry, that’s not a “bug in the agent”; that’s environment dynamics, and the agent must learn to act under those dynamics.
Engineering judgment shows up in how much you put into the environment versus the agent. If your environment is too vague (“reward = +1 if pet is good”), the agent has no usable feedback. If your environment is too complex (dozens of needs and random events), training becomes slow and confusing. For beginners, aim for an environment with rules you can explain in one minute: a few needs, a few actions, and simple, consistent reward signals.
Also note what the agent does not control: it doesn’t directly set its hunger to zero, it doesn’t decide what reward it receives, and it doesn’t rewrite the rules. When you see odd behavior, ask: is the agent making a poor choice, or did we define the environment so that the “best” reward comes from an unintended shortcut?
Practical outcome: you can label each part of your project as agent code (policy/Q-table update) or environment code (state transitions/reward calculation). This separation makes debugging and improvement much easier.
To make the RL loop work, you must define three things clearly: state, actions, and rewards. The state is the information the agent uses to decide. Actions are the choices available. Rewards are the numeric feedback after acting. If any of these are fuzzy, learning becomes unstable or meaningless.
For a beginner-friendly virtual pet, keep the state small and discrete so it fits naturally into a Q-table later. Example: represent hunger as {Low, Medium, High} and energy as {Low, High}. Then a state might be (Hunger=High, Energy=Low). This is enough for the pet to learn sensible trade-offs: when hunger is high, eating is usually good; when energy is low, sleeping might matter even if playing is tempting.
Define actions as a short list: Eat, Play, Sleep, Wait. Each action should have predictable effects. For instance: Eat decreases hunger; Play increases fun but costs energy and may increase hunger later; Sleep restores energy but time passes (hunger may rise). Then define rewards tied to outcomes, not intentions. A practical reward rule might be: +2 if hunger becomes Low after Eat; -2 if hunger is High at the end of the step; +1 if fun improves; -1 if energy becomes Low.
Common mistake: mixing “state” with “history.” Beginners often want the state to include everything (what happened five steps ago, the pet’s full backstory, etc.). In practice, you start minimal and add only what improves decisions. Another mistake is rewards that are too delayed: if the pet only gets a reward at the end of the day, it’s hard to learn which action helped. Early on, give small, frequent signals that point in the right direction.
Practical outcome: you can list the exact state variables, the exact action list, and 3–6 reward rules that shape behavior toward your goal.
RL training is organized into steps and episodes. A step is one turn of the loop: the pet observes the current state, chooses an action, the environment updates the state, and a reward is returned. An episode is a sequence of steps with a clear beginning and end—like “one day in the pet’s life” or “one play session.”
Thinking in episodes keeps your environment explainable. For example, define one episode as 50 steps (minutes). The pet starts with moderate hunger and high energy. Each step, the pet chooses one action. The environment then applies rules: hunger drifts upward over time; sleep restores energy; play reduces energy and increases hunger slightly. The episode ends when time runs out or when a failure condition occurs (e.g., hunger stays High for too many steps). Ending conditions matter because they determine what the agent experiences and what it learns to avoid.
Episodes also give you natural checkpoints for tracking learning. After each episode, you can compute the total reward (the “daily score”) and store it. Over many episodes, you’ll see whether the pet is improving. If the score isn’t improving, you debug systematically: are rewards too small or contradictory? Are states too coarse to distinguish important situations? Are episodes too short for actions to matter?
Engineering judgment: keep early episodes short and consistent. Randomness (like random hunger spikes) can be useful later for robustness, but too much randomness early makes learning noisy and discouraging to interpret. Start with deterministic, one-minute explainable rules. Once the pet learns basics, introduce small variability to prevent brittle habits.
Practical outcome: you can describe exactly what happens in one step and what counts as the start/end of an episode, using the “day in the pet’s life” framing.
RL needs a measurable definition of success. For our virtual pet, success is not a vague “acts cute” objective; it’s a pattern of behavior that earns high total reward under your environment rules. This is why you must choose goals and scoring that reflect what you care about.
A simple goal: “keep the pet healthy and entertained.” Translate that into metrics. The most direct metric is episode return (sum of rewards across the episode). But it helps to track additional counters to understand why return changes: how many steps hunger was High, how often energy was Low, how many times the pet played, and how many episodes ended early due to failure. These are not used directly for learning (initially), but they are essential for debugging and safe improvement.
When you later build a Q-table, you will be updating estimates of “how good it is to take action A in state S.” Success then looks like: (1) the average episode return trends upward; (2) failure episodes become rare; (3) the pet’s behavior becomes stable and sensible (e.g., it usually eats when hungry, sleeps when tired). Charting is your friend: plot episode return over time and optionally a moving average to smooth noise. A flat line means either the pet is not learning or the environment does not provide learnable feedback.
Common scoring mistake: using only penalties (all rewards negative). This can work, but beginners often find it hard to interpret. Another mistake: giving a huge reward for one event (like eating once) and tiny penalties for everything else, which can lead to reward hacking (the pet repeats the big-reward trick even if it harms long-term health). Aim for balanced signals where the best long-term strategy clearly wins.
Practical outcome: you can state your goal, define the episode score, and list 2–4 diagnostic counters you would chart to verify improvement.
RL is powerful, but beginners often expect the wrong things. First, RL is not “the agent understands the world.” The pet is not reasoning with human concepts like “health” unless you encode those ideas into state and reward. If the pet learns to alternate Eat and Sleep, it’s not because it cares—it’s because that sequence yields higher reward in your rules.
Second, RL is not guaranteed to produce “nice” behavior. The agent optimizes the reward you wrote, not the reward you intended. If there’s a loophole, the pet will find it. For example, if you reward “fun” without any hunger penalty, the pet may play forever. If you end episodes early when hunger is High, the agent may learn to intentionally trigger early termination if that avoids larger future penalties (depending on how rewards are assigned). These are design issues, not moral failures by the agent.
Third, RL is not always the right tool. If you already know the correct rules (“if hungry then eat”), a simple scripted policy may be better. RL shines when you can define feedback but don’t want to hand-code all decisions, or when there are trade-offs and delayed effects that are hard to tune manually.
Finally, RL is not only “run training longer.” If learning stalls, the fix is often to adjust the environment: refine states, reshape rewards, or change training settings (like how often the agent explores) in a controlled way. Make one change at a time and keep notes, because RL systems can change behavior dramatically from small tweaks.
Practical outcome: you can describe an RL problem without math—agent, environment, actions, rewards, episodes, and success metrics—and you can predict at least one way a poorly designed reward might create an unintended habit.
1. Which description best matches how the virtual pet learns in reinforcement learning (RL) in this chapter?
2. In the RL loop described in the chapter, what happens immediately after the agent takes an action?
3. When describing an RL problem without math, which set of items does the chapter say you should identify?
4. What is the main engineering warning the chapter gives about RL being a “feedback system”?
5. Why does the chapter emphasize starting with a tiny environment whose rules you can explain in one minute?
Before you can “train” a virtual pet, you must build the world it lives in. In reinforcement learning (RL), this world is called the environment. Your pet is the agent. The environment provides the agent with a state (what the pet’s situation looks like right now). The agent chooses an action (what to do). Then the environment returns a reward (a score that signals how good that choice was) and a new state (how the world changed).
This chapter turns a fuzzy idea—“take care of the pet”—into rules a computer can simulate thousands of times. The goal is not realism; the goal is a clean, learnable system. If your rules are inconsistent or too complicated, the agent will learn slowly, learn the wrong thing, or appear random. If your rules are simple and aligned with what you mean by “good care,” the Q-learning in later chapters will have a stable foundation.
We will design three needs (hunger, energy, happiness), define four beginner-friendly actions (feed, play, rest, ignore), and create rewards that push the agent toward balanced care instead of one repetitive habit. You will also simulate one short episode by hand, because stepping through a few timesteps is the fastest way to catch mistakes before you write training code.
By the end of this chapter you will have a complete environment specification: states, actions, transitions, rewards, and termination rules—enough to plug into a Q-table trainer.
Practice note for Design the pet’s needs: hunger, energy, and happiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define actions: feed, play, rest, and ignore: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create rewards: what gets points and what loses points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Simulate one episode by hand to test your rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: a complete environment spec: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design the pet’s needs: hunger, energy, and happiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define actions: feed, play, rest, and ignore: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create rewards: what gets points and what loses points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
State is the information your agent is allowed to use when making a decision. For a beginner RL project, you want state variables that are (1) few in number, (2) easy to update, and (3) directly related to the behavior you want. Our virtual pet will have three needs: hunger, energy, and happiness. The simplest version is to represent each need as a small integer scale.
A practical choice is a discrete 0–4 scale for each need. For example: hunger 0 means “not hungry,” hunger 4 means “very hungry.” Energy 0 means “exhausted,” energy 4 means “fully rested.” Happiness 0 means “sad,” happiness 4 means “thriving.” Discrete bins keep the Q-table small: with 5 values each, you have 5×5×5 = 125 possible states—easy to store and train.
Engineering judgment: resist the urge to add more variables (cleanliness, thirst, health, boredom) early on. Each new variable multiplies the state space and makes learning slower. Another common mistake is mixing directions (e.g., “hunger” where higher is worse, “energy” where higher is better) and then forgetting which way is which when writing rewards. If you keep the meaning consistent—higher means “more of that property,” even if more hunger is bad—your transition and reward code stays readable.
Define the state explicitly as a tuple: (hunger, energy, happiness). Also decide the starting state for each episode, such as (2, 2, 2) for “average needs.” Fixed starts make debugging easier; later you can randomize starts to improve robustness.
Actions are the buttons your agent can press. Beginner environments work best when the action list is short, meaningful, and always available. We’ll use four actions that map cleanly to pet care: feed, play, rest, and ignore. These are discrete actions: one choice per timestep.
Make actions “atomic.” Each action should represent a single, simple interaction that predictably changes needs. For example, feed should primarily reduce hunger, while rest should primarily increase energy. Play should increase happiness but usually costs energy and may increase hunger. Ignore does nothing helpful and allows needs to drift in a worse direction. The agent learns by comparing these trade-offs.
Two practical design rules help avoid confusion. First, keep the action set constant; don’t remove actions in some states (e.g., “can’t play if energy is 0”) unless you also handle invalid actions consistently. If you want constraints, a common approach is: allow the action but make it ineffective and/or penalized when the pet is too tired, so the agent learns not to waste moves. Second, ensure every action is sometimes useful. If one action is always dominated (always worse than another), it becomes noise and slows learning.
In code later, you’ll map these actions to integers for the Q-table, such as: 0=feed, 1=play, 2=rest, 3=ignore. Write the mapping down now as part of your environment spec so your training loop and your analysis charts interpret actions consistently.
Reward design is where you translate “good pet owner” into a score signal the agent can optimize. A good reward is aligned with your intention, not just with a single metric that can be exploited. If you only reward happiness, the agent may spam play even when hunger and energy are critical. If you only punish hunger, it may over-feed and ignore happiness. Balanced care is the goal, so reward should reflect balance.
A practical starting pattern is: (1) small living penalty each step to encourage efficiency, (2) positive reward for improving a need that is currently bad, and (3) negative reward for letting any need reach dangerous levels. For example:
Notice the structure: the reward is mostly about outcomes (state after action), not the action label itself. This helps avoid “reward hacking” where the agent learns to press a certain button regardless of state. Another common mistake is making rewards too large or too sparse. If you only reward at the end of an episode, learning is slow because the agent doesn’t know which decisions mattered. If you make every tiny change worth huge points, training becomes unstable and the agent may chase short-term swings.
Keep numbers small and comparable (single digits to low tens). You can tune later, but start with something you can reason about by hand. In the next section we’ll define when episodes end, because termination interacts strongly with reward: if bad states end the episode, the agent experiences those outcomes more sharply.
An episode is one “life segment” of your pet: a sequence of timesteps where the agent tries to keep things going. Terminal conditions define when the episode ends. In RL terms, a terminal state is absorbing: once reached, the run stops and the environment resets. Clear terminal rules prevent confusing edge cases and let you measure progress with episode length and total reward.
For a virtual pet, natural terminal events are severe neglect. Choose a small set of endings that are easy to detect from the state. For example, end the episode if any of these happens:
The time limit is important even if you have failure endings. Once the agent gets good, it might maintain safe values for a long time, and you still want training to cycle through many episodes. Also, a fixed horizon makes learning curves easier to compare.
Design choice: do you end immediately on a critical state, or allow recovery? Ending immediately makes the consequences of neglect sharp and simple. Allowing recovery is more realistic but requires careful reward shaping and can create confusing loops (the agent repeatedly “almost fails” to harvest some reward). For a first project, immediate termination on critical thresholds is safer.
Finally, decide what reward happens on termination. A typical approach is to apply a strong negative terminal reward (e.g., −20) in addition to any step reward, so the agent clearly prefers policies that avoid ending conditions. Write this down explicitly so your implementation is deterministic.
Transition rules are the physics of your pet world: given the current state and an action, how do hunger, energy, and happiness change? These rules must be consistent, bounded, and easy to compute. A reliable beginner pattern is to apply two layers each step: (1) the action effect, then (2) a small “natural drift” that happens every timestep (pets get a bit hungrier over time, for example). This prevents the agent from finding a static state where nothing changes.
Here is a concrete, beginner-friendly deterministic transition specification. Clamp all variables to their allowed ranges after updates (hunger 0–4, energy 0–4, happiness 0–4):
With these numbers, each action has a cost and a benefit, and none is always best. Feeding is powerful against hunger but doesn’t directly build happiness. Playing improves happiness but drains energy and increases hunger. Rest restores energy but can reduce happiness (the pet gets bored) and increases hunger. Ignoring accelerates decline and should rarely be chosen once the agent learns.
Common mistakes: forgetting to clamp values (leading to hunger becoming −3 or 99), applying drift twice, or applying drift before action in one part of code and after action elsewhere. Choose one order and document it. Determinism is also helpful early: add randomness later, after the Q-learning loop works, because randomness makes debugging harder.
Before you train anything, test the environment like an engineer. The fastest test is to simulate one short episode by hand and check that the numbers behave as expected. This catches reward mistakes (“playing gives points even when the pet is starving”), transition bugs (hunger moving the wrong way), and termination issues (episodes never end, or end instantly).
Start from (hunger=2, energy=2, happiness=2) with a 30-step time limit. Suppose you take actions: play → feed → rest → play. Apply the transition rules and drift consistently:
This is exactly why manual simulation matters. You can respond in two ways: (1) change drift to hunger +1 only every other step, (2) reduce play’s hunger increase, or (3) change terminal threshold to hunger==5 by expanding the scale. There is no single “correct” choice—your goal is a learnable environment where good policies exist and failures are avoidable with reasonable play.
Also run “sanity checks” without math: if the pet is starving, does feeding reliably help? If energy is low, does rest help but not magically fix hunger and happiness too? If you ignore repeatedly, does the episode end quickly with negative total reward? These checks ensure the reward and termination rules produce the story you intend.
Checkpoint deliverable: write a one-page environment spec containing (a) state variables and ranges, (b) action list and integer mapping, (c) transition table including drift and clamping, (d) reward rules including terminal rewards, (e) terminal conditions and time limit, and (f) starting state. Once this spec is stable, you’re ready to implement the environment and begin Q-table training in the next chapter.
1. In Chapter 2, what sequence describes the basic interaction loop between the pet (agent) and the world (environment)?
2. Why does the chapter emphasize keeping the pet-world rules simple and consistent?
3. Which set correctly lists the three needs designed as part of the pet’s state in this chapter?
4. Which best describes the purpose of creating rewards in this environment?
5. Why does the chapter recommend simulating one short episode by hand before writing training code?
In Chapter 2 you turned a “virtual pet” into something a computer can interact with: a small world with clear states, a small set of actions, and a reward signal. Now we give the pet a learning notebook. This notebook is not magical—it’s a table of numbers that gets edited after every experience. Over time, those edits make the pet less random and more sensible.
This chapter focuses on Q-tables (a classic, beginner-friendly reinforcement learning tool). You will learn what a Q-value means in everyday terms, how to set up a Q-table for the pet world, how to update it from one step of experience, how repeated episodes make the table “fill in,” and what it looks like when the pet starts making better choices.
As you read, keep one engineering mindset: the table is a memory of “what tended to work,” not a guarantee. Your job is to shape the pet’s memory so it learns stable habits and doesn’t get stuck in silly loops.
Practice note for What a Q-value means using everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a Q-table for the pet world: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Update the table: learn from one step of experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run repeated episodes and watch the table change: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: the pet starts making better choices: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for What a Q-value means using everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a Q-table for the pet world: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Update the table: learn from one step of experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run repeated episodes and watch the table change: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: the pet starts making better choices: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Imagine your pet has a tiny notebook. For every situation it can recognize (a state), it writes down how good each possible choice (an action) seems. Each note is a Q-value: “If I’m in this situation and I do this action, how good should I expect the outcome to be?” In everyday language, Q-values are expected usefulness scores.
At the start, the notebook is empty—so we usually fill it with zeros. That does not mean “everything is equally good in real life.” It means “the pet has no evidence yet.” Then the pet begins to act, receive rewards, and adjust the numbers. After enough experiences, the Q-values become a practical guide: the pet can look at a state, compare the action scores, and pick the best one most of the time.
This is the key shift from “try things” to “remember what worked.” Random trying is still important early on (exploration), but the notebook allows the pet to cash in what it has learned later (exploitation).
A common mistake is to treat Q-values like rules (“always do X”). They are not rules; they are estimates based on experience. If your rewards are noisy or your states are too vague, the notebook will reflect that—and the pet will learn confusing habits.
A Q-table is literally a table: rows are states, columns are actions, and each cell contains a Q-value. The layout forces you to be explicit about what the agent can sense and what it can do—this is why Q-learning is such a good teaching tool.
Let’s define a simple pet world you can actually implement. Suppose the pet tracks two needs: hunger and energy. To keep the table small for beginners, we discretize them into binary categories:
That gives 4 states total: (Hungry,Tired), (Hungry,Rested), (Full,Tired), (Full,Rested). Now define three actions: Feed, Play, Rest. Your Q-table is a 4×3 grid.
Each cell Q(s,a) answers: “If I am in state s and choose action a, how good is that choice expected to be?” In code, you can represent the table as:
Engineering judgment: keep the state space small at first. Beginners often add too many state details (“mood,” “cleanliness,” “time of day,” “toy type”) and end up with a table that never gets enough experiences per cell to learn reliable values. If you want richer behavior later, you can expand the state gradually, but first get the learning loop working end-to-end.
Practical outcome: once your Q-table exists, you can print it after each episode. Watching numbers change is one of the clearest ways to “see” reinforcement learning happening.
The heart of Q-learning is the update rule: after the pet takes one action and sees what happens, it revises the notebook entry for that (state, action). The simplest form looks like:
Q(s,a) ← Q(s,a) + α × (target − Q(s,a))
This reads like common sense editing: “new value = old value + (a fraction of the mistake).” The learning rate α (alpha) is that fraction. If α is 0.1, you correct 10% of the gap between what you predicted (Q) and what you just observed (the target). If α is 1.0, you overwrite the old value completely with the new target.
In everyday terms, α controls how quickly your pet changes its mind:
Common mistake: setting α high to “learn faster” and then being surprised when the pet’s behavior swings wildly between episodes. If your environment has randomness (e.g., sometimes play gives extra happiness), high α can cause the Q-values to chase noise.
Practical workflow: start with α around 0.1 or 0.2 for small toy problems. If learning is painfully slow, increase slightly. If the Q-table oscillates or never settles, decrease α and ensure rewards are not excessively large.
So far, it sounds like we only care about the immediate reward. But good pet behavior often means taking a small short-term cost for a better future. For example: resting now might enable play later, or feeding now prevents a large hunger penalty later. Q-learning handles this by including future rewards in the target.
The target typically is:
target = r + γ × maxa′ Q(s′, a′)
Here, r is the immediate reward you just received, s′ is the next state after your action, and the max term represents “the best expected value from the next situation onward.” The discount factor γ (gamma) is how much the pet cares about the future.
Plain-language interpretation: γ is like patience. A patient pet learns routines that set it up for success; an impatient pet learns quick fixes.
Engineering judgment: for episodic tasks (your pet has short “days” or episodes), γ is often set around 0.9 to 0.99. But if your rewards include ongoing living costs (like a small negative reward each step to encourage efficiency), a high γ can be helpful to value finishing the episode sooner. A common mistake is γ=1.0 with no safeguards; this can make learning unstable in environments where loops are possible and rewards don’t naturally end.
Practical outcome: once γ is in place, your Q-table no longer reflects just “what feels good now,” but “what tends to lead to a good life for the pet over multiple steps.”
Let’s do one concrete update so the rule becomes mechanical rather than mysterious. Assume:
Suppose the pet is in state s = (Hungry,Rested). It chooses action a = Feed. The environment responds:
Assume the notebook currently has:
Compute the target:
target = r + γ × maxQ(next) = 4 + 0.9 × 3.0 = 4 + 2.7 = 6.7
Now update the old Q-value toward the target:
newQ = oldQ + α × (target − oldQ) = 1.0 + 0.2 × (6.7 − 1.0)
newQ = 1.0 + 0.2 × 5.7 = 1.0 + 1.14 = 2.14
Interpretation: the pet previously thought feeding while hungry/rested was only mildly good (1.0). After seeing that feeding leads to a strong immediate reward and a promising next situation, it raises that estimate to 2.14. It did not jump all the way to 6.7 because α was only 0.2—it is learning cautiously.
Run repeated episodes and these values will keep moving. Early on, you’ll see many zeros and small numbers. Later, the table develops “structure”: in hungry states, Feed becomes high; in tired states, Rest becomes high; and the pet begins to make better choices even before it has explored every possible sequence perfectly.
Once your update rule works, the next challenge is preventing your pet from learning brittle or chaotic policies. Stability is not only about math—it’s also about sensible engineering choices: rewards, exploration, episode length, and monitoring.
Guardrail 1: Exploration vs. exploitation. If the pet always picks the current best Q-value, it may get stuck in a “good enough” habit and never discover better routines. Use an ε-greedy policy: with probability ε choose a random action (explore), otherwise choose the best-known action (exploit). Start ε relatively high (e.g., 0.3) and decay it slowly over episodes (e.g., toward 0.05) so the pet explores early and behaves reliably later.
Guardrail 2: Reward scaling and signs. Rewards that are too large can create huge Q-values and unstable learning. Rewards that are mostly negative can also work, but you must be consistent and ensure the pet can still find improvement. A common mistake is giving a big reward for one action (like +100 for Play) and tiny penalties elsewhere; the pet may spam that action regardless of state.
Guardrail 3: Episode design. Define a clear episode boundary (a “day” for the pet). Without boundaries, loops can dominate learning. If episodes are too short, the pet cannot experience long-term consequences; too long, and learning may become slow and noisy.
Guardrail 4: Track progress. Keep a simple per-episode score: total reward. Plot it or print a rolling average. If the average reward improves over time and variance decreases, learning is becoming stable. If it collapses suddenly, suspect a bug in the update, state transitions, or reward function.
Checkpoint: when these guardrails are in place, you should be able to print the Q-table after training and see that the highest values align with common sense—feed when hungry, rest when tired, and play when needs are met. That is the moment the virtual pet stops acting like a coin flip and starts acting like it is learning.
1. In this chapter, what is a Q-table meant to represent for the virtual pet?
2. Which description best matches a Q-value in everyday language?
3. To create a Q-table for the pet world, what must its rows and columns correspond to?
4. What is the purpose of updating the Q-table after one step of experience?
5. Why do repeated episodes help the Q-table “fill in” and the pet make better choices?
In the last chapter, you built a Q-table: a simple memory that estimates how good each action is in each state. That immediately raises a tempting idea: “If we already know the best action, why not always take it?” This chapter is about why that instinct can quietly ruin learning—and how to teach your virtual pet the good habit of trying new things without becoming reckless.
Reinforcement learning is a loop: the agent (your pet brain) observes a state from the environment (hunger, boredom, energy, time-of-day, etc.), chooses an action (eat, sleep, play, train), and receives a reward (good or bad). A Q-table only improves if it sees enough variety: different states and different actions, including the “boring” ones that initially look worse. Exploration is how the agent collects that evidence.
But exploration must be controlled. Too little exploration and the pet gets stuck in a habit loop (like always sleeping because it once got a small reward). Too much exploration and the pet looks random forever, never settling into a reliable routine. The key is balancing exploration (try actions to learn) with exploitation (use what you’ve learned to score well). In practice, you implement this balance with an epsilon-greedy action selection policy and tune a few training settings safely.
This chapter will give you an engineering workflow: (1) implement epsilon-greedy, (2) schedule epsilon over training, (3) keep beginner-safe hyperparameter ranges for learning rate and discount, (4) measure progress across many episodes (not one), and (5) use simple logs to spot bad patterns like loops and persistent randomness. By the end, your checkpoint is consistent improvement across episodes—not perfection, but a trend you can trust.
Practice note for Why “always pick the best known action” can fail: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Epsilon-greedy choices: controlled curiosity: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune training settings: learning rate, discount, exploration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Spot bad learning patterns: loops and random behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: consistent improvement across episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Why “always pick the best known action” can fail: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Epsilon-greedy choices: controlled curiosity: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune training settings: learning rate, discount, exploration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Imagine dropping a kid into a brand-new playground. If they sprint to the first slide they see and then refuse to try anything else, they might miss the swings, the climbing wall, or the hidden tunnel that’s actually the most fun. “Always pick the best known thing” fails because the “best known thing” is based on incomplete experience.
Your virtual pet has the same issue. Early in training, most Q-values start at 0 (or some neutral default). If the pet happens to get a small positive reward from one action early—maybe “sleep” reduces hunger slightly in your simplified rules—it may repeatedly exploit that action. The Q-table then gets lots of updates for sleep, and almost none for play, eat, or train. The pet looks consistent, but it’s consistently under-trained.
This is how bad habits form in RL: the agent over-commits before it has evidence. You’ll commonly see two failure modes:
Exploration is your mechanism to make the agent behave like the curious kid: try multiple pieces of equipment long enough to learn what’s truly best. The tricky part is doing this without turning training into chaos. That’s exactly what epsilon-greedy gives you: curiosity with a dial.
Epsilon-greedy is a simple rule for choosing actions: with probability ε (epsilon), explore by picking a random action; with probability 1 − ε, exploit by picking the action with the highest Q-value in the current state. It’s popular because it’s easy to implement and easy to reason about.
Here is the step-by-step workflow you should follow each time the pet must act:
Two practical details matter a lot for beginners. First, define how you break ties when multiple actions share the same max Q-value. If you always pick the first action, you accidentally bias behavior and reduce exploration even when ε is low. A common fix is: if several actions are tied for max, pick randomly among the tied actions.
Second, ensure “random action” really means uniform over valid actions in that state. If some actions are invalid (e.g., “eat” when no food is available), either remove them from the action set for that state or assign a clear negative reward. Silent invalid actions can create misleading Q-values and make the agent look random for the wrong reason.
Epsilon-greedy won’t magically solve everything, but it gives you controlled curiosity. The pet still learns from rewards, but it also gathers enough diverse experience to stop confusing luck with skill.
Using a fixed epsilon is workable, but it’s rarely ideal. Early training should be adventurous because the Q-table is mostly empty. Later training should be focused because you want consistent behavior. This motivates an epsilon schedule: start with higher exploration, then reduce it over episodes.
A practical mental model: in episodes 1–50, you want the pet to “try everything and see what happens.” In episodes 200–500, you want it to “mostly do what works, occasionally double-check.” The schedule is how you encode that shift.
Common schedules that are beginner-friendly:
Keep a non-zero floor, like ε_min = 0.01 or 0.05. Why? Because environments can be stochastic, and your learned policy can drift into a brittle pattern that only works under certain random outcomes. A tiny amount of exploration acts like ongoing quality assurance.
Engineering judgement: decay epsilon based on evidence, not superstition. If your episode rewards are still volatile and trending upward, you may be decaying too quickly. If rewards plateau early and the pet repeats a loop, you may be decaying too slowly or your reward design may be encouraging the wrong habit. A useful checkpoint is to compare behavior at ε=0.3 versus ε=0.05: the first should look curious; the second should look purposeful.
In Q-learning, three settings strongly shape behavior: learning rate (α), discount factor (γ), and exploration (ε). If your pet seems stuck, random, or overly cautious, these are the first knobs to check—after confirming your state and reward definitions make sense.
Learning rate α (alpha) controls how quickly new experiences overwrite old beliefs. High α makes the pet adapt quickly but can cause wobbling when rewards are noisy; low α makes learning stable but slow. Beginner-safe range: 0.1 to 0.5. If your environment is very noisy, lean lower (0.1–0.2). If it’s deterministic and simple, you can use 0.3–0.5.
Discount factor γ (gamma) controls how much future rewards matter. A pet that should plan ahead (eat now to have energy to play later) needs a fairly high γ. Beginner-safe range: 0.85 to 0.99. Lower γ makes the pet short-sighted, often chasing immediate small rewards and forming loops. Higher γ encourages long-term routines but can slow learning if rewards are sparse.
Exploration ε (epsilon) controls trial behavior, not memory updates. Typical starting ε: 0.2 to 0.5. Typical minimum ε: 0.01 to 0.05. If the pet looks random late in training, your ε may be too high (or your state space is too large for the number of episodes). If the pet locks into a habit too early, ε may be too low or decaying too fast.
Common beginner mistake: changing all three at once. Instead, change one parameter, run multiple training runs (next section), and compare average episode reward curves. This isolates cause and effect. Another mistake is using extreme rewards to “force” behavior; that can create brittle strategies and unstable Q-values. Prefer moderate, well-shaped rewards and adjust hyperparameters gradually.
Reinforcement learning is noisy by nature. Even with the same code, two training runs can diverge because early random actions lead to different experiences, which produce different Q-values, which produce different future actions. If you judge progress from a single run, you can be fooled by luck—good or bad.
To make conclusions you can trust, adopt a simple evaluation habit: repeat runs and average results. For example, train the same setup with 5–20 different random seeds. Track the episode reward for each run, then compute the mean (and ideally the standard deviation). This tells you whether your improvement is consistent or fragile.
Also use moving averages within a run. Episode rewards often bounce because exploration intentionally injects “bad” actions. A 20-episode moving average smooths that noise so you can see the trend. When your chapter checkpoint says “consistent improvement across episodes,” it usually means the moving average rises over time, not that every single episode is better than the last.
Two practical patterns to watch for:
When you compare settings, keep evaluation separate from training. A clean method is: train with epsilon-greedy, then run a short “test” phase with ε=0 (pure exploitation) to see what policy the pet actually learned. This prevents exploration noise from hiding whether the Q-table is improving.
When behavior looks wrong, resist the urge to guess. Add a few lightweight logs that tell you what the agent believed and why it acted. You don’t need fancy dashboards; a consistent text log can reveal loops, mis-specified rewards, and broken exploration in minutes.
At minimum, print or record these fields for a small sample of episodes (for example, the first 2 episodes, then every 50th):
With that information, you can diagnose specific bad learning patterns. If the pet is stuck in a loop, the log will show repeated state-action pairs and often a reward that is accidentally positive (or not negative enough) for the looping behavior. If the pet looks random late in training, check whether epsilon is actually decaying, whether tie-breaking is biasing choices, and whether Q-values are staying near zero (a sign the pet isn’t receiving informative rewards).
A practical debugging routine: pick one problematic episode, replay it with logs at every step, and ask three questions: (1) Was the action chosen due to exploration or exploitation? (2) Did the reward match your intended “good habit” rules? (3) Did the Q-value update move in the expected direction? If any answer surprises you, you’ve found where to fix the environment rules, reward shaping, or training settings. This is how you turn “my pet is weird” into a concrete, correctable engineering issue.
1. Why can “always pick the best known action” quietly ruin learning in a Q-table agent?
2. In an epsilon-greedy policy, what does increasing epsilon primarily do?
3. Which training outcome best signals too little exploration?
4. Which workflow best matches the chapter’s recommended approach to balancing exploration and exploitation?
5. What is the chapter’s checkpoint for success during training?
Up to now, your virtual pet has been learning through trial and error: it takes an action, the environment changes, and it gets a reward (or penalty). That loop is the heart of reinforcement learning—but it is also where beginners get misled. A pet can “look” smarter in a few episodes just because it got lucky. Or it can truly improve while still having occasional bad days. In this chapter you will add simple, reliable ways to measure progress, and you will refine the reward rules so the agent learns faster without learning the wrong thing.
The workflow is: (1) define success metrics that match your real goal (like average score and survival time), (2) visualize training with basic charts and moving averages, (3) adjust rewards carefully (reward shaping) to guide learning, (4) watch for reward hacking—when the agent exploits your rules rather than solving the intended task—and (5) make the environment slightly harder in a controlled way to create a stronger, more reliable pet agent.
Remember: when you change rewards or states you are not “fixing the agent,” you are rewriting the problem. That’s normal. The skill is doing it safely, making one change at a time, and measuring whether the change really helps.
Practice note for Define success metrics: average score and survival time: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Visualize learning: simple charts and moving averages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Reward shaping: guide the pet without “cheating”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Prevent reward hacking: when the pet exploits your rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: a stronger, more reliable pet agent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define success metrics: average score and survival time: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Visualize learning: simple charts and moving averages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Reward shaping: guide the pet without “cheating”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Prevent reward hacking: when the pet exploits your rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: a stronger, more reliable pet agent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Before you tune anything, decide what “good” means. For a virtual pet, two beginner-friendly success metrics are average score per episode and survival time (how many steps the pet stays alive or above “game over” thresholds). Score summarizes the reward your rules produce; survival time captures whether the pet avoids catastrophic states. If your pet can survive for a long time but earns low score, it may be playing too safely. If score is high but survival is low, it may be taking risky actions for short-term reward.
Track metrics at the episode level: after each episode ends, log total reward (episode return) and steps survived. Then compute an average over many episodes. A single episode is noisy; reinforcement learning has randomness from exploration, environment dynamics, and initial state. Trust trends, not individual wins.
A common mistake is to optimize a metric that is too close to your current reward definition. If your reward gives +1 for “eating” and -1 for “being hungry,” the pet may learn to trigger “eating” repeatedly even if it ignores sleep. Survival time helps you detect these imbalances. Another mistake is changing two things at once: if you adjust rewards and also change epsilon decay, you won’t know which change caused the improvement.
Once you log episode return and survival steps, plot them. The simplest visualization is a line chart with episode number on the x-axis and total reward on the y-axis. For beginners, the key technique is a moving average, such as a rolling mean over the last 50 or 100 episodes. The raw curve will look jagged because exploration forces occasional “bad choices.” The moving average tells the real story: is the pet improving overall?
Read training curves like a narrative with chapters:
Use two charts: one for average return and one for survival time. If return improves but survival does not, your reward is likely pushing “point scoring” rather than robustness. If survival improves but return does not, the agent may be avoiding both good and bad outcomes. That’s a cue to revisit reward balance.
Practical tip: plot evaluation curves every N episodes. For example, after every 200 training episodes, run 20 evaluation episodes with epsilon set to 0.01 (almost greedy). Plot the evaluation average separately. This prevents you from confusing “learning” with “exploration noise.”
Reward shaping means adding extra reward signals to guide the agent toward good behavior sooner. Done well, shaping is like giving hints. Done poorly, it is like bribing the pet to do something unrelated to the real goal.
A safe mindset is: keep the main objective reward intact (for example, +10 for staying healthy at episode end, or -10 for “game over”), and add small, local shaping rewards that point in the same direction. For a virtual pet, examples include: a small penalty each step hunger is high, a small bonus when hunger decreases, or a small penalty when energy hits a critical threshold. These make learning less “blind” than only rewarding at the end.
Why are big bribes risky? Because Q-learning maximizes expected return; it will gladly farm a repeatable reward loop even if the pet’s overall wellbeing does not improve. Keep shaping rewards smaller than your terminal success/failure rewards so the agent still cares most about survival and long-term health.
Engineering judgement: introduce shaping one piece at a time, and watch how the training curves change. If learning gets faster but evaluation performance gets worse, you may have created a shortcut that only works during exploration or only works under specific random conditions.
Reward designs fall on a spectrum from sparse to dense. Sparse rewards are rare: the pet might get 0 most steps, then +10 at the end if it survives, and -10 if it fails. Dense rewards give frequent feedback: small bonuses and penalties almost every step.
Sparse rewards are conceptually clean and harder to exploit, because there are fewer “reward tokens” to game. The downside is that learning can be slow: the agent must stumble into good long-term behavior before it receives any signal that it was good. Beginners often interpret this slow progress as “my Q-table is broken,” when the real issue is that the agent rarely experiences success.
Dense rewards speed up learning by giving gradient-like hints. But they increase the risk of teaching the wrong lesson. If you reward “eating” frequently, the pet may overeat. If you reward “moving” to explore, it may pace endlessly even when it should rest.
A practical compromise for a first virtual pet is a hybrid:
When you move from sparse to dense rewards, expect the average score chart to change scale. This is normal. Compare policies by evaluation behavior and survival time, not only by the raw total reward number.
Reward hacking happens when the agent finds a way to earn reward that violates your intent. It is not “cheating” in a human sense; it is exactly what you asked for mathematically. Your job is to notice it early and adjust the environment rules or rewards.
Common virtual-pet reward hacks:
Fixes should be targeted and minimal:
After a fix, rerun evaluation episodes. A good sign is when survival time and evaluation return improve together, and the pet’s behavior looks stable across many random starts—not just one lucky scenario.
Once the agent is improving, it is tempting to “upgrade” the world dramatically: more states, more actions, more randomness. But big changes can erase what you learned about debugging and can destabilize training. Instead, make the environment slightly harder in controlled steps, and use checkpoints to keep a stronger, more reliable pet agent.
Safe ways to increase difficulty:
Do this with a checkpoint routine: save your Q-table and training settings when evaluation performance reaches a new best. Then introduce one environment change, retrain, and compare against the checkpoint using the same evaluation protocol. If performance drops, you can roll back and try a smaller change.
Also consider training settings as part of “safe difficulty.” When the environment becomes harder, you may need slower epsilon decay (more exploration), a slightly lower learning rate for stability, or more episodes. The goal is not just a higher chart—it is a pet that behaves well across varied situations. That is the checkpoint for this chapter: reliable improvement you can measure, explain, and reproduce.
1. Why does Chapter 5 emphasize using metrics like average score and survival time instead of judging progress from a few episodes?
2. What is the main purpose of using moving averages when visualizing training?
3. Which statement best describes reward shaping as presented in this chapter?
4. What is reward hacking in the context of this chapter?
5. According to the chapter, what is the safest way to improve the system when you change rewards or states?
Up to this point, you’ve trained a virtual pet by iterating quickly: change a reward, rerun training, glance at scores, repeat. That is exactly how learning RL feels at the beginning. Now you’ll turn that notebook-style work into a small, clean project you can run again, share with others, and extend safely.
This chapter focuses on engineering judgment: separating responsibilities (environment vs. agent vs. training loop), making results reproducible, saving what the pet learned, and running a fair evaluation that distinguishes “trained behavior” from “tested behavior.” You’ll also look ahead: how to scale your pet to bigger worlds without getting overwhelmed, and how to recognize when Q-learning with a Q-table is reaching its limits.
By the end, you should be able to present a simple demo: start from scratch, train, save, reload, test, and show a small chart or printed metrics that prove your pet improved.
Practice note for Refactor into a clean project: environment, agent, training loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Save and load what the pet learned: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a final evaluation: training vs. testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Next steps: bigger worlds and deeper methods (no overwhelm): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Final checkpoint: present your virtual pet RL demo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Refactor into a clean project: environment, agent, training loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Save and load what the pet learned: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run a final evaluation: training vs. testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Next steps: bigger worlds and deeper methods (no overwhelm): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Final checkpoint: present your virtual pet RL demo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Refactor into a clean project: environment, agent, training loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The fastest way to make an RL project painful is to mix everything together: environment rules, Q-table updates, epsilon-greedy action selection, plotting, and printing all in one file. It works once, then becomes hard to trust and harder to extend. The goal is not “enterprise architecture.” The goal is to separate concerns so you can change one part without breaking the rest.
A clean beginner structure usually has three pieces:
Practically, this can be three files: env.py, agent.py, train.py. Your environment should expose methods like reset() (return initial state) and step(action) (return next_state, reward, done, info). Keep rendering (printing the pet mood, hunger, etc.) optional; it’s helpful for demos but slows training.
Your agent should not know “pet rules.” It should only know how to map (state, action) to Q-values and update them: Q[s,a] = Q[s,a] + alpha * (reward + gamma * max_a' Q[s',a'] - Q[s,a]). A common mistake is letting the agent call environment internals directly (“if hunger high then…”). That turns learning into hard-coded behavior and hides bugs.
In the training loop, keep a clear episode boundary: reset, iterate steps until done, record total reward. This makes it straightforward to add saving/loading later and to compare training vs. testing without accidental differences.
RL is noisy. Two runs with the “same” code can learn different behaviors because randomness changes which experiences the agent sees early on. Beginners often interpret this as “my code is broken,” when the real issue is that the experiment isn’t controlled. Reproducibility doesn’t remove randomness—it makes it manageable.
Start by choosing a single seed and applying it everywhere you use randomness:
random.seed)numpy.random.seed)rng = np.random.default_rng(seed) stored inside the environment)Then store the seed in your run configuration and print it at startup. If you later discover a bug, you can rerun the exact same experience stream to confirm the fix. This is especially important when you change rewards or states: without fixed seeds, you can’t tell whether performance changed because of your design choice or because the run “got lucky.”
Next, make training settings explicit: number of episodes, max steps per episode, learning rate (alpha), discount (gamma), exploration schedule (starting epsilon, ending epsilon, decay). Put these in a small config dict or a simple JSON file. A common mistake is tweaking constants scattered across files, then forgetting what produced a “good run.”
Finally, log results consistently. At minimum, store per-episode total reward and length. If you chart anything, chart the same thing each run (for example, a moving average of total reward). Reproducibility is not bureaucracy—it’s what lets you learn from your own experiments instead of guessing.
Training metrics can lie. During training, the agent is exploring (taking random actions some of the time), and the environment might also be randomized. If you only look at training rewards, you can’t tell whether the learned policy is actually good or just occasionally lucky.
A fair workflow is to split your runs into two modes:
Evaluation should be done over multiple episodes (for example, 20–100) and summarized with an average reward and a spread (min/max or standard deviation). One episode is not evidence. Also keep evaluation rules consistent: same max steps, same environment difficulty settings, and ideally a different seed from training so you measure general behavior, not memorization of one random sequence.
This is also where saving and loading matters. A strong check is: train, save the Q-table to disk, start a fresh process, load it, and then evaluate. If performance collapses, you likely depended on some hidden state in memory (for example, forgetting that you were still updating Q-values during “testing”).
Common mistakes include evaluating with epsilon still > 0 (“my agent is worse than before!”—it’s just exploring), comparing runs with different episode lengths, or changing reward definitions between training and testing. Treat evaluation like a small scientific experiment: fixed rules, frozen policy, repeated trials.
Once your project is structured, extending it becomes safer. The virtual pet is a great sandbox because small changes immediately create new learning challenges. The trick is to extend in a controlled way so you can still diagnose what changed.
Typical extensions fall into three categories:
Engineering judgment: change one thing at a time. If you add two new state variables and three new actions and rewrite rewards, you won’t know what caused learning to improve or fail. A practical pattern is: add one feature, rerun training with the same seeds and settings, then evaluate. If learning breaks, roll back and simplify.
Also watch for reward loopholes. If “sleep” gives a positive reward and has no downside, the agent may sleep forever. A simple fix is a step penalty (e.g., -0.01 each step) or making actions trade off (sleep restores energy but increases hunger). Your environment should model trade-offs so the agent must learn priorities instead of exploiting a free reward.
Q-tables are perfect for learning the basics because they are transparent: you can print values and see what the agent “believes.” But they have a hard limit: the table grows with #states × #actions. If your pet state includes multiple variables with many possible values, the number of combinations explodes.
For example, suppose you discretize: hunger (10 bins), energy (10), cleanliness (10), mood (10), and location (20). That’s 10×10×10×10×20 = 200,000 states. With 8 actions, your Q-table has 1.6 million values. That’s not impossible in memory, but it becomes difficult to learn because the agent must visit many state-action pairs repeatedly to get reliable estimates.
Symptoms you’ve hit the “too many states” wall:
Before jumping to deep learning, you still have practical options:
Saving and loading helps here too. You can train longer across multiple sessions by persisting the Q-table (e.g., JSON, CSV, or a binary format). Just save not only Q-values, but also the mapping of states/actions and your discretization rules. A common mistake is loading a Q-table after changing the state definition—values no longer align, and behavior becomes nonsensical.
When the world grows beyond what a table can handle, you replace the table with a function that estimates Q-values. That’s the core idea behind deeper RL: instead of storing Q(s,a) in a lookup table, you approximate it with a model that can generalize across similar states.
A low-overwhelm roadmap from here:
As a final checkpoint for this course, package your virtual pet demo like a tiny product: one command to train, one command to evaluate, and one option to render a short episode. In your demo, show: (1) training curve (moving average reward), (2) saved Q-table loaded into a fresh run, and (3) evaluation results over multiple episodes with exploration off. If you can do that, you’ve moved from “I ran RL once” to “I built an RL experiment I can trust and extend.”
1. Why does Chapter 6 emphasize separating the environment, agent, and training loop into distinct parts?
2. What is the main purpose of saving and loading what the pet learned?
3. What does a fair final evaluation need to distinguish, according to the chapter?
4. What is the point of making results reproducible in this packaged project?
5. When scaling to bigger worlds, what does Chapter 6 suggest you watch for with Q-learning using a Q-table?