HELP

+40 722 606 166

messenger@eduailast.com

Reinforcement Learning for Beginners: Train a Virtual Pet

Reinforcement Learning — Beginner

Reinforcement Learning for Beginners: Train a Virtual Pet

Reinforcement Learning for Beginners: Train a Virtual Pet

Train a virtual pet with rewards and learn reinforcement learning from scratch.

Beginner reinforcement-learning · beginner-ai · virtual-pet · q-learning

Train a virtual pet and learn reinforcement learning from zero

This course is a short, book-style path into reinforcement learning (RL) for absolute beginners. You won’t need any background in AI, math, coding, or data science. Instead of starting with heavy theory, you will learn RL the way it is easiest to understand: by training a virtual pet using rewards.

Reinforcement learning is “learning by trying.” An agent takes an action, sees what happens, receives a reward (good or bad), and slowly builds better behavior over time. In this course, your agent is the “pet brain,” and the environment is the “pet world” with clear rules: hunger can go up, energy can go down, and some actions help more than others. By the end, you’ll have a complete beginner-friendly RL project you can explain and extend.

What you will build

You will design a small world where a pet must stay healthy and happy. The pet can choose actions like feeding, playing, or resting. At first, it will act randomly. Then you’ll teach it with a simple learning method (a Q-table) so it starts choosing better actions more often.

  • A clear pet “state” (what the pet is like right now)
  • A set of actions (what the pet can do)
  • A reward system (how the pet gets points)
  • A training loop (practice over many episodes)
  • Simple evaluation (did it actually improve?)

Why this course is different

Many RL tutorials assume you already know programming and probability words. This course explains each idea from first principles and keeps the project small enough to understand fully. You’ll learn what each piece is for, how to test it, and how to fix it when learning goes wrong.

How the chapters progress

Chapter 1 introduces RL in plain language using the virtual pet story. Chapter 2 turns the story into a real environment: states, actions, rewards, and “episode ends.” Chapter 3 introduces the Q-table, a simple memory that helps the pet learn from experience. Chapter 4 teaches the key training skill in RL: balancing curiosity (exploration) and using what you already know (exploitation). Chapter 5 focuses on measurement and reward design so you can make the pet learn the right thing (not just a trick). Chapter 6 helps you package the project, test it fairly, and understand what to learn next.

Who this is for

This course is for anyone who wants to understand reinforcement learning without being overwhelmed. It works well for students, career switchers, product managers, analysts, and anyone curious about how “agents” learn behaviors.

Get started

If you’re ready to train your first agent and finally understand what reinforcement learning is doing under the hood, jump in and follow the steps chapter by chapter. Register free to begin, or browse all courses to see related learning paths.

What You Will Learn

  • Explain reinforcement learning using the ideas of agent, environment, actions, and rewards
  • Turn a simple “virtual pet” problem into clear rules a computer can learn from
  • Build and update a Q-table to help an agent choose better actions over time
  • Use exploration vs. exploitation to avoid getting stuck in bad habits
  • Track learning with simple scores and charts to see if the pet is improving
  • Improve an RL setup by changing rewards, states, and training settings safely
  • Recognize common RL failure cases (loops, reward hacking) and fix them

Requirements

  • No prior AI or coding experience required
  • A computer with internet access
  • Willingness to follow step-by-step instructions and practice with small exercises

Chapter 1: Reinforcement Learning in Plain English

  • Meet the virtual pet: what it can do and what it should learn
  • The RL loop: observe, act, get reward, repeat
  • Actions, rewards, and goals: how behavior is shaped
  • Your first tiny environment: rules you can explain in one minute
  • Checkpoint: describe an RL problem without math

Chapter 2: Build the Pet World (States, Actions, Rewards)

  • Design the pet’s needs: hunger, energy, and happiness
  • Define actions: feed, play, rest, and ignore
  • Create rewards: what gets points and what loses points
  • Simulate one episode by hand to test your rules
  • Checkpoint: a complete environment spec

Chapter 3: The Learning Notebook: Q-Tables from Scratch

  • What a Q-value means using everyday language
  • Create a Q-table for the pet world
  • Update the table: learn from one step of experience
  • Run repeated episodes and watch the table change
  • Checkpoint: the pet starts making better choices

Chapter 4: Exploration vs. Exploitation (Teaching Good Habits)

  • Why “always pick the best known action” can fail
  • Epsilon-greedy choices: controlled curiosity
  • Tune training settings: learning rate, discount, exploration
  • Spot bad learning patterns: loops and random behavior
  • Checkpoint: consistent improvement across episodes

Chapter 5: Measure Progress and Improve the Rewards

  • Define success metrics: average score and survival time
  • Visualize learning: simple charts and moving averages
  • Reward shaping: guide the pet without “cheating”
  • Prevent reward hacking: when the pet exploits your rules
  • Checkpoint: a stronger, more reliable pet agent

Chapter 6: Package Your First RL Project (and What’s Next)

  • Refactor into a clean project: environment, agent, training loop
  • Save and load what the pet learned
  • Run a final evaluation: training vs. testing
  • Next steps: bigger worlds and deeper methods (no overwhelm)
  • Final checkpoint: present your virtual pet RL demo

Sofia Chen

Machine Learning Engineer, Reinforcement Learning Educator

Sofia Chen builds practical machine learning systems and specializes in teaching reinforcement learning to beginners. She focuses on clear, step-by-step explanations and small projects that feel like real progress. Her courses are designed for learners starting from zero and aiming to build working agents quickly.

Chapter 1: Reinforcement Learning in Plain English

Reinforcement learning (RL) is the most “human-feeling” style of machine learning: the system learns by trying things, seeing what happens, and remembering what worked. In this course, you’ll train a simple virtual pet. The pet won’t “understand” the world the way you do, and it won’t read instructions. Instead, it will follow a loop: observe its situation, take an action, receive a reward (good or bad), and repeat. Over time it will develop preferences for actions that lead to better outcomes.

This chapter builds the mental model you’ll use for the rest of the book. You’ll meet the pet and define what it can do, then translate that into a tiny environment with rules you can explain in one minute. You’ll also learn to describe an RL problem without math: identify the agent, the environment, the actions, and the rewards—and define what “better” means in a measurable way.

As you read, keep one engineering idea in mind: RL is not magic; it’s a feedback system. If you design the feedback (states and rewards) well, learning looks impressive. If you design it poorly, the pet will learn weird habits quickly and confidently. The skill is not only running training, but shaping the problem so the training signal matches what you actually want.

  • We will use a “virtual pet” as our running example.
  • We will focus on simple, explainable rules before adding complexity.
  • We will treat progress as something you can measure, chart, and debug.

By the end of this chapter you should be able to describe an RL setup in plain English and predict how changes to rewards or available actions might change behavior. That ability—clear problem formulation—is the foundation for building a Q-table and improving it safely in later chapters.

Practice note for Meet the virtual pet: what it can do and what it should learn: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for The RL loop: observe, act, get reward, repeat: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Actions, rewards, and goals: how behavior is shaped: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Your first tiny environment: rules you can explain in one minute: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: describe an RL problem without math: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Meet the virtual pet: what it can do and what it should learn: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for The RL loop: observe, act, get reward, repeat: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: What “learning by rewards” means

“Learning by rewards” means the agent isn’t given the correct action up front. Instead, it experiments, and the environment scores the result. A reward is just a number the environment returns after an action. Positive rewards encourage a behavior; negative rewards discourage it. The key is that the agent’s goal is to maximize reward over time, not to “do what humans would do.” This is why reward design is the heart of reinforcement learning.

Meet our virtual pet: imagine a small creature with needs (hunger, boredom, energy) that change as time passes. The pet can do a few things—eat, sleep, play, or do nothing. You decide what “good pet care” means by setting rewards. For example, if the pet eats when hungry, give a positive reward; if it ignores hunger for too long, give a negative reward. The pet will eventually prefer actions that tend to produce higher total reward.

A common beginner mistake is to treat rewards like praise for “nice” behavior rather than a precise control signal. In engineering terms, reward is your specification. If you reward “play” every time, the pet may play constantly and starve. If you punish “sleep” too strongly, the pet may learn to stay awake even when exhausted because it avoids punishment in the short term. Good RL setups use rewards to encode trade-offs and long-term consequences.

Practical outcome: you should be able to write down two or three sentences that explain what the pet is trying to maximize (total reward) and how the environment will judge each action (reward rules). That’s already an RL problem, even before any code.

Section 1.2: Agent vs. environment (who controls what)

RL becomes clear when you separate “agent” from “environment.” The agent is the decision-maker: it chooses actions. The environment is everything else: it provides observations (state information), applies the consequences of actions, and returns rewards. The environment also controls the rules of the world—what changes, what is allowed, and when an episode ends.

In our virtual pet scenario, the pet is the agent. The world around it—its hunger level increasing over time, energy decreasing when playing, the reward for eating when hungry—is the environment. This division matters because you can only train what the agent controls. If the pet keeps getting hungry, that’s not a “bug in the agent”; that’s environment dynamics, and the agent must learn to act under those dynamics.

Engineering judgment shows up in how much you put into the environment versus the agent. If your environment is too vague (“reward = +1 if pet is good”), the agent has no usable feedback. If your environment is too complex (dozens of needs and random events), training becomes slow and confusing. For beginners, aim for an environment with rules you can explain in one minute: a few needs, a few actions, and simple, consistent reward signals.

Also note what the agent does not control: it doesn’t directly set its hunger to zero, it doesn’t decide what reward it receives, and it doesn’t rewrite the rules. When you see odd behavior, ask: is the agent making a poor choice, or did we define the environment so that the “best” reward comes from an unintended shortcut?

Practical outcome: you can label each part of your project as agent code (policy/Q-table update) or environment code (state transitions/reward calculation). This separation makes debugging and improvement much easier.

Section 1.3: States, actions, and rewards (with pet examples)

To make the RL loop work, you must define three things clearly: state, actions, and rewards. The state is the information the agent uses to decide. Actions are the choices available. Rewards are the numeric feedback after acting. If any of these are fuzzy, learning becomes unstable or meaningless.

For a beginner-friendly virtual pet, keep the state small and discrete so it fits naturally into a Q-table later. Example: represent hunger as {Low, Medium, High} and energy as {Low, High}. Then a state might be (Hunger=High, Energy=Low). This is enough for the pet to learn sensible trade-offs: when hunger is high, eating is usually good; when energy is low, sleeping might matter even if playing is tempting.

Define actions as a short list: Eat, Play, Sleep, Wait. Each action should have predictable effects. For instance: Eat decreases hunger; Play increases fun but costs energy and may increase hunger later; Sleep restores energy but time passes (hunger may rise). Then define rewards tied to outcomes, not intentions. A practical reward rule might be: +2 if hunger becomes Low after Eat; -2 if hunger is High at the end of the step; +1 if fun improves; -1 if energy becomes Low.

  • Good states are informative but not huge; they help the agent choose differently in different situations.
  • Good actions are meaningful; each should change the situation in a way the agent can learn.
  • Good rewards align with your goal; avoid rewarding something easy to exploit.

Common mistake: mixing “state” with “history.” Beginners often want the state to include everything (what happened five steps ago, the pet’s full backstory, etc.). In practice, you start minimal and add only what improves decisions. Another mistake is rewards that are too delayed: if the pet only gets a reward at the end of the day, it’s hard to learn which action helped. Early on, give small, frequent signals that point in the right direction.

Practical outcome: you can list the exact state variables, the exact action list, and 3–6 reward rules that shape behavior toward your goal.

Section 1.4: Episodes and steps (a day in the pet’s life)

RL training is organized into steps and episodes. A step is one turn of the loop: the pet observes the current state, chooses an action, the environment updates the state, and a reward is returned. An episode is a sequence of steps with a clear beginning and end—like “one day in the pet’s life” or “one play session.”

Thinking in episodes keeps your environment explainable. For example, define one episode as 50 steps (minutes). The pet starts with moderate hunger and high energy. Each step, the pet chooses one action. The environment then applies rules: hunger drifts upward over time; sleep restores energy; play reduces energy and increases hunger slightly. The episode ends when time runs out or when a failure condition occurs (e.g., hunger stays High for too many steps). Ending conditions matter because they determine what the agent experiences and what it learns to avoid.

Episodes also give you natural checkpoints for tracking learning. After each episode, you can compute the total reward (the “daily score”) and store it. Over many episodes, you’ll see whether the pet is improving. If the score isn’t improving, you debug systematically: are rewards too small or contradictory? Are states too coarse to distinguish important situations? Are episodes too short for actions to matter?

Engineering judgment: keep early episodes short and consistent. Randomness (like random hunger spikes) can be useful later for robustness, but too much randomness early makes learning noisy and discouraging to interpret. Start with deterministic, one-minute explainable rules. Once the pet learns basics, introduce small variability to prevent brittle habits.

Practical outcome: you can describe exactly what happens in one step and what counts as the start/end of an episode, using the “day in the pet’s life” framing.

Section 1.5: What success looks like (scores and goals)

RL needs a measurable definition of success. For our virtual pet, success is not a vague “acts cute” objective; it’s a pattern of behavior that earns high total reward under your environment rules. This is why you must choose goals and scoring that reflect what you care about.

A simple goal: “keep the pet healthy and entertained.” Translate that into metrics. The most direct metric is episode return (sum of rewards across the episode). But it helps to track additional counters to understand why return changes: how many steps hunger was High, how often energy was Low, how many times the pet played, and how many episodes ended early due to failure. These are not used directly for learning (initially), but they are essential for debugging and safe improvement.

When you later build a Q-table, you will be updating estimates of “how good it is to take action A in state S.” Success then looks like: (1) the average episode return trends upward; (2) failure episodes become rare; (3) the pet’s behavior becomes stable and sensible (e.g., it usually eats when hungry, sleeps when tired). Charting is your friend: plot episode return over time and optionally a moving average to smooth noise. A flat line means either the pet is not learning or the environment does not provide learnable feedback.

Common scoring mistake: using only penalties (all rewards negative). This can work, but beginners often find it hard to interpret. Another mistake: giving a huge reward for one event (like eating once) and tiny penalties for everything else, which can lead to reward hacking (the pet repeats the big-reward trick even if it harms long-term health). Aim for balanced signals where the best long-term strategy clearly wins.

Practical outcome: you can state your goal, define the episode score, and list 2–4 diagnostic counters you would chart to verify improvement.

Section 1.6: Common misconceptions (what RL is not)

RL is powerful, but beginners often expect the wrong things. First, RL is not “the agent understands the world.” The pet is not reasoning with human concepts like “health” unless you encode those ideas into state and reward. If the pet learns to alternate Eat and Sleep, it’s not because it cares—it’s because that sequence yields higher reward in your rules.

Second, RL is not guaranteed to produce “nice” behavior. The agent optimizes the reward you wrote, not the reward you intended. If there’s a loophole, the pet will find it. For example, if you reward “fun” without any hunger penalty, the pet may play forever. If you end episodes early when hunger is High, the agent may learn to intentionally trigger early termination if that avoids larger future penalties (depending on how rewards are assigned). These are design issues, not moral failures by the agent.

Third, RL is not always the right tool. If you already know the correct rules (“if hungry then eat”), a simple scripted policy may be better. RL shines when you can define feedback but don’t want to hand-code all decisions, or when there are trade-offs and delayed effects that are hard to tune manually.

Finally, RL is not only “run training longer.” If learning stalls, the fix is often to adjust the environment: refine states, reshape rewards, or change training settings (like how often the agent explores) in a controlled way. Make one change at a time and keep notes, because RL systems can change behavior dramatically from small tweaks.

Practical outcome: you can describe an RL problem without math—agent, environment, actions, rewards, episodes, and success metrics—and you can predict at least one way a poorly designed reward might create an unintended habit.

Chapter milestones
  • Meet the virtual pet: what it can do and what it should learn
  • The RL loop: observe, act, get reward, repeat
  • Actions, rewards, and goals: how behavior is shaped
  • Your first tiny environment: rules you can explain in one minute
  • Checkpoint: describe an RL problem without math
Chapter quiz

1. Which description best matches how the virtual pet learns in reinforcement learning (RL) in this chapter?

Show answer
Correct answer: It learns by trying actions, seeing outcomes, and remembering what led to better rewards
The chapter frames RL as learning through trial, feedback (rewards), and repetition—not instructions or imitation.

2. In the RL loop described in the chapter, what happens immediately after the agent takes an action?

Show answer
Correct answer: The agent receives a reward (good or bad) and then continues the loop
The loop is: observe, act, get reward, repeat; reward is the feedback right after acting.

3. When describing an RL problem without math, which set of items does the chapter say you should identify?

Show answer
Correct answer: Agent, environment, actions, rewards, and what “better” means measurably
The chapter emphasizes clear problem formulation: define the agent, environment, actions, rewards, and measurable success.

4. What is the main engineering warning the chapter gives about RL being a “feedback system”?

Show answer
Correct answer: Poorly designed states or rewards can cause the pet to learn weird habits quickly and confidently
Because RL follows the feedback you design, bad rewards or state design can reliably train undesired behavior.

5. Why does the chapter emphasize starting with a tiny environment whose rules you can explain in one minute?

Show answer
Correct answer: So behavior and rewards are easy to measure, chart, and debug before adding complexity
The course prioritizes simple, explainable setups so you can predict and debug how actions and rewards shape behavior.

Chapter 2: Build the Pet World (States, Actions, Rewards)

Before you can “train” a virtual pet, you must build the world it lives in. In reinforcement learning (RL), this world is called the environment. Your pet is the agent. The environment provides the agent with a state (what the pet’s situation looks like right now). The agent chooses an action (what to do). Then the environment returns a reward (a score that signals how good that choice was) and a new state (how the world changed).

This chapter turns a fuzzy idea—“take care of the pet”—into rules a computer can simulate thousands of times. The goal is not realism; the goal is a clean, learnable system. If your rules are inconsistent or too complicated, the agent will learn slowly, learn the wrong thing, or appear random. If your rules are simple and aligned with what you mean by “good care,” the Q-learning in later chapters will have a stable foundation.

We will design three needs (hunger, energy, happiness), define four beginner-friendly actions (feed, play, rest, ignore), and create rewards that push the agent toward balanced care instead of one repetitive habit. You will also simulate one short episode by hand, because stepping through a few timesteps is the fastest way to catch mistakes before you write training code.

  • State: what the pet needs right now (numbers or categories)
  • Action: the choice the agent can make (feed/play/rest/ignore)
  • Reward: immediate score for that action in that state
  • Episode: one run from start until an ending condition

By the end of this chapter you will have a complete environment specification: states, actions, transitions, rewards, and termination rules—enough to plug into a Q-table trainer.

Practice note for Design the pet’s needs: hunger, energy, and happiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define actions: feed, play, rest, and ignore: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create rewards: what gets points and what loses points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Simulate one episode by hand to test your rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: a complete environment spec: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design the pet’s needs: hunger, energy, and happiness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define actions: feed, play, rest, and ignore: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create rewards: what gets points and what loses points: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Choosing simple state variables

State is the information your agent is allowed to use when making a decision. For a beginner RL project, you want state variables that are (1) few in number, (2) easy to update, and (3) directly related to the behavior you want. Our virtual pet will have three needs: hunger, energy, and happiness. The simplest version is to represent each need as a small integer scale.

A practical choice is a discrete 0–4 scale for each need. For example: hunger 0 means “not hungry,” hunger 4 means “very hungry.” Energy 0 means “exhausted,” energy 4 means “fully rested.” Happiness 0 means “sad,” happiness 4 means “thriving.” Discrete bins keep the Q-table small: with 5 values each, you have 5×5×5 = 125 possible states—easy to store and train.

Engineering judgment: resist the urge to add more variables (cleanliness, thirst, health, boredom) early on. Each new variable multiplies the state space and makes learning slower. Another common mistake is mixing directions (e.g., “hunger” where higher is worse, “energy” where higher is better) and then forgetting which way is which when writing rewards. If you keep the meaning consistent—higher means “more of that property,” even if more hunger is bad—your transition and reward code stays readable.

Define the state explicitly as a tuple: (hunger, energy, happiness). Also decide the starting state for each episode, such as (2, 2, 2) for “average needs.” Fixed starts make debugging easier; later you can randomize starts to improve robustness.

Section 2.2: Discrete actions a beginner can manage

Actions are the buttons your agent can press. Beginner environments work best when the action list is short, meaningful, and always available. We’ll use four actions that map cleanly to pet care: feed, play, rest, and ignore. These are discrete actions: one choice per timestep.

Make actions “atomic.” Each action should represent a single, simple interaction that predictably changes needs. For example, feed should primarily reduce hunger, while rest should primarily increase energy. Play should increase happiness but usually costs energy and may increase hunger. Ignore does nothing helpful and allows needs to drift in a worse direction. The agent learns by comparing these trade-offs.

Two practical design rules help avoid confusion. First, keep the action set constant; don’t remove actions in some states (e.g., “can’t play if energy is 0”) unless you also handle invalid actions consistently. If you want constraints, a common approach is: allow the action but make it ineffective and/or penalized when the pet is too tired, so the agent learns not to waste moves. Second, ensure every action is sometimes useful. If one action is always dominated (always worse than another), it becomes noise and slows learning.

In code later, you’ll map these actions to integers for the Q-table, such as: 0=feed, 1=play, 2=rest, 3=ignore. Write the mapping down now as part of your environment spec so your training loop and your analysis charts interpret actions consistently.

Section 2.3: Reward design that encourages good care

Reward design is where you translate “good pet owner” into a score signal the agent can optimize. A good reward is aligned with your intention, not just with a single metric that can be exploited. If you only reward happiness, the agent may spam play even when hunger and energy are critical. If you only punish hunger, it may over-feed and ignore happiness. Balanced care is the goal, so reward should reflect balance.

A practical starting pattern is: (1) small living penalty each step to encourage efficiency, (2) positive reward for improving a need that is currently bad, and (3) negative reward for letting any need reach dangerous levels. For example:

  • Step cost: −1 each timestep (prevents endless stalling)
  • Need improvement bonus: +2 if an action moves the most critical need toward safety (e.g., feeding when hunger is high)
  • Critical penalties: −10 if hunger hits 4 (starving), −10 if energy hits 0 (collapse), −10 if happiness hits 0 (very unhappy)
  • Care bonus: +5 if all needs are in a “good band,” such as hunger ≤1, energy ≥3, happiness ≥3

Notice the structure: the reward is mostly about outcomes (state after action), not the action label itself. This helps avoid “reward hacking” where the agent learns to press a certain button regardless of state. Another common mistake is making rewards too large or too sparse. If you only reward at the end of an episode, learning is slow because the agent doesn’t know which decisions mattered. If you make every tiny change worth huge points, training becomes unstable and the agent may chase short-term swings.

Keep numbers small and comparable (single digits to low tens). You can tune later, but start with something you can reason about by hand. In the next section we’ll define when episodes end, because termination interacts strongly with reward: if bad states end the episode, the agent experiences those outcomes more sharply.

Section 2.4: Terminal conditions (when an episode ends)

An episode is one “life segment” of your pet: a sequence of timesteps where the agent tries to keep things going. Terminal conditions define when the episode ends. In RL terms, a terminal state is absorbing: once reached, the run stops and the environment resets. Clear terminal rules prevent confusing edge cases and let you measure progress with episode length and total reward.

For a virtual pet, natural terminal events are severe neglect. Choose a small set of endings that are easy to detect from the state. For example, end the episode if any of these happens:

  • Starvation: hunger == 4
  • Exhaustion: energy == 0
  • Misery: happiness == 0
  • Time limit: t reaches a max, e.g., 30 steps (prevents infinite episodes)

The time limit is important even if you have failure endings. Once the agent gets good, it might maintain safe values for a long time, and you still want training to cycle through many episodes. Also, a fixed horizon makes learning curves easier to compare.

Design choice: do you end immediately on a critical state, or allow recovery? Ending immediately makes the consequences of neglect sharp and simple. Allowing recovery is more realistic but requires careful reward shaping and can create confusing loops (the agent repeatedly “almost fails” to harvest some reward). For a first project, immediate termination on critical thresholds is safer.

Finally, decide what reward happens on termination. A typical approach is to apply a strong negative terminal reward (e.g., −20) in addition to any step reward, so the agent clearly prefers policies that avoid ending conditions. Write this down explicitly so your implementation is deterministic.

Section 2.5: Transition rules (how actions change the state)

Transition rules are the physics of your pet world: given the current state and an action, how do hunger, energy, and happiness change? These rules must be consistent, bounded, and easy to compute. A reliable beginner pattern is to apply two layers each step: (1) the action effect, then (2) a small “natural drift” that happens every timestep (pets get a bit hungrier over time, for example). This prevents the agent from finding a static state where nothing changes.

Here is a concrete, beginner-friendly deterministic transition specification. Clamp all variables to their allowed ranges after updates (hunger 0–4, energy 0–4, happiness 0–4):

  • Action: feed → hunger −2, happiness +0, energy +0
  • Action: play → happiness +2, energy −1, hunger +1
  • Action: rest → energy +2, happiness −1, hunger +1
  • Action: ignore → happiness −1, hunger +1, energy −1
  • Natural drift (after action) → hunger +1 each step (pets get hungry over time)

With these numbers, each action has a cost and a benefit, and none is always best. Feeding is powerful against hunger but doesn’t directly build happiness. Playing improves happiness but drains energy and increases hunger. Rest restores energy but can reduce happiness (the pet gets bored) and increases hunger. Ignoring accelerates decline and should rarely be chosen once the agent learns.

Common mistakes: forgetting to clamp values (leading to hunger becoming −3 or 99), applying drift twice, or applying drift before action in one part of code and after action elsewhere. Choose one order and document it. Determinism is also helpful early: add randomness later, after the Q-learning loop works, because randomness makes debugging harder.

Section 2.6: Testing the environment before training

Before you train anything, test the environment like an engineer. The fastest test is to simulate one short episode by hand and check that the numbers behave as expected. This catches reward mistakes (“playing gives points even when the pet is starving”), transition bugs (hunger moving the wrong way), and termination issues (episodes never end, or end instantly).

Start from (hunger=2, energy=2, happiness=2) with a 30-step time limit. Suppose you take actions: play → feed → rest → play. Apply the transition rules and drift consistently:

  • t0 state (2,2,2). Action play: happiness +2 → 4, energy −1 → 1, hunger +1 → 3; drift hunger +1 → 4. New state (4,1,4) → this triggers starvation terminal immediately if your rule is hunger==4 ends. That tells you something important: with drift, play may be too risky when hunger is moderate.

This is exactly why manual simulation matters. You can respond in two ways: (1) change drift to hunger +1 only every other step, (2) reduce play’s hunger increase, or (3) change terminal threshold to hunger==5 by expanding the scale. There is no single “correct” choice—your goal is a learnable environment where good policies exist and failures are avoidable with reasonable play.

Also run “sanity checks” without math: if the pet is starving, does feeding reliably help? If energy is low, does rest help but not magically fix hunger and happiness too? If you ignore repeatedly, does the episode end quickly with negative total reward? These checks ensure the reward and termination rules produce the story you intend.

Checkpoint deliverable: write a one-page environment spec containing (a) state variables and ranges, (b) action list and integer mapping, (c) transition table including drift and clamping, (d) reward rules including terminal rewards, (e) terminal conditions and time limit, and (f) starting state. Once this spec is stable, you’re ready to implement the environment and begin Q-table training in the next chapter.

Chapter milestones
  • Design the pet’s needs: hunger, energy, and happiness
  • Define actions: feed, play, rest, and ignore
  • Create rewards: what gets points and what loses points
  • Simulate one episode by hand to test your rules
  • Checkpoint: a complete environment spec
Chapter quiz

1. In Chapter 2, what sequence describes the basic interaction loop between the pet (agent) and the world (environment)?

Show answer
Correct answer: Environment gives a state → agent chooses an action → environment returns a reward and a new state
The chapter defines the RL loop as state from the environment, action from the agent, then reward and next state from the environment.

2. Why does the chapter emphasize keeping the pet-world rules simple and consistent?

Show answer
Correct answer: Because inconsistent or overly complex rules can make the agent learn slowly, learn the wrong thing, or look random
The chapter argues the goal is a clean, learnable system; bad rule design harms learning behavior.

3. Which set correctly lists the three needs designed as part of the pet’s state in this chapter?

Show answer
Correct answer: Hunger, energy, happiness
The chapter specifies three needs to represent the pet’s situation: hunger, energy, and happiness.

4. Which best describes the purpose of creating rewards in this environment?

Show answer
Correct answer: To signal how good an action choice was and push the agent toward balanced care instead of repeating one habit
Rewards are immediate scores intended to guide behavior, specifically toward balanced care.

5. Why does the chapter recommend simulating one short episode by hand before writing training code?

Show answer
Correct answer: Because stepping through a few timesteps is a fast way to catch mistakes in states, transitions, and rewards
The chapter frames hand simulation as a quick validation step to find rule issues early.

Chapter 3: The Learning Notebook: Q-Tables from Scratch

In Chapter 2 you turned a “virtual pet” into something a computer can interact with: a small world with clear states, a small set of actions, and a reward signal. Now we give the pet a learning notebook. This notebook is not magical—it’s a table of numbers that gets edited after every experience. Over time, those edits make the pet less random and more sensible.

This chapter focuses on Q-tables (a classic, beginner-friendly reinforcement learning tool). You will learn what a Q-value means in everyday terms, how to set up a Q-table for the pet world, how to update it from one step of experience, how repeated episodes make the table “fill in,” and what it looks like when the pet starts making better choices.

As you read, keep one engineering mindset: the table is a memory of “what tended to work,” not a guarantee. Your job is to shape the pet’s memory so it learns stable habits and doesn’t get stuck in silly loops.

Practice note for What a Q-value means using everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a Q-table for the pet world: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Update the table: learn from one step of experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run repeated episodes and watch the table change: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: the pet starts making better choices: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for What a Q-value means using everyday language: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a Q-table for the pet world: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Update the table: learn from one step of experience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run repeated episodes and watch the table change: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: the pet starts making better choices: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: From “try things” to “remember what worked”

Imagine your pet has a tiny notebook. For every situation it can recognize (a state), it writes down how good each possible choice (an action) seems. Each note is a Q-value: “If I’m in this situation and I do this action, how good should I expect the outcome to be?” In everyday language, Q-values are expected usefulness scores.

At the start, the notebook is empty—so we usually fill it with zeros. That does not mean “everything is equally good in real life.” It means “the pet has no evidence yet.” Then the pet begins to act, receive rewards, and adjust the numbers. After enough experiences, the Q-values become a practical guide: the pet can look at a state, compare the action scores, and pick the best one most of the time.

This is the key shift from “try things” to “remember what worked.” Random trying is still important early on (exploration), but the notebook allows the pet to cash in what it has learned later (exploitation).

  • State: what the pet can observe right now (e.g., Hungry vs. Not Hungry).
  • Action: what it can do (e.g., Feed, Play, Rest).
  • Reward: an immediate score from the environment (e.g., +5 if feeding reduces hunger, -3 if playing while hungry).
  • Q-value: the pet’s running estimate of how good an action is in a state, considering both immediate and future outcomes.

A common mistake is to treat Q-values like rules (“always do X”). They are not rules; they are estimates based on experience. If your rewards are noisy or your states are too vague, the notebook will reflect that—and the pet will learn confusing habits.

Section 3.2: The Q-table layout (rows, columns, meaning)

A Q-table is literally a table: rows are states, columns are actions, and each cell contains a Q-value. The layout forces you to be explicit about what the agent can sense and what it can do—this is why Q-learning is such a good teaching tool.

Let’s define a simple pet world you can actually implement. Suppose the pet tracks two needs: hunger and energy. To keep the table small for beginners, we discretize them into binary categories:

  • Hunger: Hungry or Full
  • Energy: Tired or Rested

That gives 4 states total: (Hungry,Tired), (Hungry,Rested), (Full,Tired), (Full,Rested). Now define three actions: Feed, Play, Rest. Your Q-table is a 4×3 grid.

Each cell Q(s,a) answers: “If I am in state s and choose action a, how good is that choice expected to be?” In code, you can represent the table as:

  • a 2D array indexed by state_id and action_id, or
  • a dictionary keyed by (state, action) pairs.

Engineering judgment: keep the state space small at first. Beginners often add too many state details (“mood,” “cleanliness,” “time of day,” “toy type”) and end up with a table that never gets enough experiences per cell to learn reliable values. If you want richer behavior later, you can expand the state gradually, but first get the learning loop working end-to-end.

Practical outcome: once your Q-table exists, you can print it after each episode. Watching numbers change is one of the clearest ways to “see” reinforcement learning happening.

Section 3.3: The update rule (learning rate explained simply)

The heart of Q-learning is the update rule: after the pet takes one action and sees what happens, it revises the notebook entry for that (state, action). The simplest form looks like:

Q(s,a) ← Q(s,a) + α × (target − Q(s,a))

This reads like common sense editing: “new value = old value + (a fraction of the mistake).” The learning rate α (alpha) is that fraction. If α is 0.1, you correct 10% of the gap between what you predicted (Q) and what you just observed (the target). If α is 1.0, you overwrite the old value completely with the new target.

In everyday terms, α controls how quickly your pet changes its mind:

  • Small α (e.g., 0.05 to 0.2): slow, steady learning; more stable but takes more episodes.
  • Large α (e.g., 0.5 to 1.0): learns quickly from recent events; can become jumpy or unstable if rewards vary a lot.

Common mistake: setting α high to “learn faster” and then being surprised when the pet’s behavior swings wildly between episodes. If your environment has randomness (e.g., sometimes play gives extra happiness), high α can cause the Q-values to chase noise.

Practical workflow: start with α around 0.1 or 0.2 for small toy problems. If learning is painfully slow, increase slightly. If the Q-table oscillates or never settles, decrease α and ensure rewards are not excessively large.

Section 3.4: Future rewards (discount factor in plain terms)

So far, it sounds like we only care about the immediate reward. But good pet behavior often means taking a small short-term cost for a better future. For example: resting now might enable play later, or feeding now prevents a large hunger penalty later. Q-learning handles this by including future rewards in the target.

The target typically is:

target = r + γ × maxa′ Q(s′, a′)

Here, r is the immediate reward you just received, s′ is the next state after your action, and the max term represents “the best expected value from the next situation onward.” The discount factor γ (gamma) is how much the pet cares about the future.

  • γ near 0: the pet is short-sighted (“only today matters”). It will chase immediate rewards even if that causes trouble later.
  • γ near 1: the pet is far-sighted (“long-term wellbeing matters”). It will accept small costs if they lead to better future outcomes.

Plain-language interpretation: γ is like patience. A patient pet learns routines that set it up for success; an impatient pet learns quick fixes.

Engineering judgment: for episodic tasks (your pet has short “days” or episodes), γ is often set around 0.9 to 0.99. But if your rewards include ongoing living costs (like a small negative reward each step to encourage efficiency), a high γ can be helpful to value finishing the episode sooner. A common mistake is γ=1.0 with no safeguards; this can make learning unstable in environments where loops are possible and rewards don’t naturally end.

Practical outcome: once γ is in place, your Q-table no longer reflects just “what feels good now,” but “what tends to lead to a good life for the pet over multiple steps.”

Section 3.5: A worked example update (numbers you can follow)

Let’s do one concrete update so the rule becomes mechanical rather than mysterious. Assume:

  • States: (Hungry,Rested), (Full,Rested), etc.
  • Actions: Feed, Play, Rest
  • α = 0.2 (learning rate)
  • γ = 0.9 (discount factor)

Suppose the pet is in state s = (Hungry,Rested). It chooses action a = Feed. The environment responds:

  • Immediate reward r = +4 (feeding is good when hungry)
  • Next state s′ = (Full,Rested)

Assume the notebook currently has:

  • Q((Hungry,Rested), Feed) = 1.0
  • In the next state, the best action value is maxa′ Q((Full,Rested), a′) = 3.0 (maybe Play is currently valued there)

Compute the target:

target = r + γ × maxQ(next) = 4 + 0.9 × 3.0 = 4 + 2.7 = 6.7

Now update the old Q-value toward the target:

newQ = oldQ + α × (target − oldQ) = 1.0 + 0.2 × (6.7 − 1.0)

newQ = 1.0 + 0.2 × 5.7 = 1.0 + 1.14 = 2.14

Interpretation: the pet previously thought feeding while hungry/rested was only mildly good (1.0). After seeing that feeding leads to a strong immediate reward and a promising next situation, it raises that estimate to 2.14. It did not jump all the way to 6.7 because α was only 0.2—it is learning cautiously.

Run repeated episodes and these values will keep moving. Early on, you’ll see many zeros and small numbers. Later, the table develops “structure”: in hungry states, Feed becomes high; in tired states, Rest becomes high; and the pet begins to make better choices even before it has explored every possible sequence perfectly.

Section 3.6: Keeping learning stable (basic guardrails)

Once your update rule works, the next challenge is preventing your pet from learning brittle or chaotic policies. Stability is not only about math—it’s also about sensible engineering choices: rewards, exploration, episode length, and monitoring.

Guardrail 1: Exploration vs. exploitation. If the pet always picks the current best Q-value, it may get stuck in a “good enough” habit and never discover better routines. Use an ε-greedy policy: with probability ε choose a random action (explore), otherwise choose the best-known action (exploit). Start ε relatively high (e.g., 0.3) and decay it slowly over episodes (e.g., toward 0.05) so the pet explores early and behaves reliably later.

Guardrail 2: Reward scaling and signs. Rewards that are too large can create huge Q-values and unstable learning. Rewards that are mostly negative can also work, but you must be consistent and ensure the pet can still find improvement. A common mistake is giving a big reward for one action (like +100 for Play) and tiny penalties elsewhere; the pet may spam that action regardless of state.

Guardrail 3: Episode design. Define a clear episode boundary (a “day” for the pet). Without boundaries, loops can dominate learning. If episodes are too short, the pet cannot experience long-term consequences; too long, and learning may become slow and noisy.

Guardrail 4: Track progress. Keep a simple per-episode score: total reward. Plot it or print a rolling average. If the average reward improves over time and variance decreases, learning is becoming stable. If it collapses suddenly, suspect a bug in the update, state transitions, or reward function.

Checkpoint: when these guardrails are in place, you should be able to print the Q-table after training and see that the highest values align with common sense—feed when hungry, rest when tired, and play when needs are met. That is the moment the virtual pet stops acting like a coin flip and starts acting like it is learning.

Chapter milestones
  • What a Q-value means using everyday language
  • Create a Q-table for the pet world
  • Update the table: learn from one step of experience
  • Run repeated episodes and watch the table change
  • Checkpoint: the pet starts making better choices
Chapter quiz

1. In this chapter, what is a Q-table meant to represent for the virtual pet?

Show answer
Correct answer: A memory of how well different actions tended to work in different states
The chapter frames the Q-table as a simple “learning notebook” that stores what tended to work, not a perfect guarantee.

2. Which description best matches a Q-value in everyday language?

Show answer
Correct answer: A number that estimates how good an action is in a specific situation
A Q-value is a numeric guess of how useful an action is from a given state, based on experience.

3. To create a Q-table for the pet world, what must its rows and columns correspond to?

Show answer
Correct answer: Rows = states, columns = actions, with a number in each state-action cell
A Q-table is organized by state and action so each entry stores a value for a particular state-action choice.

4. What is the purpose of updating the Q-table after one step of experience?

Show answer
Correct answer: To edit the table based on what just happened so the pet becomes less random over time
The chapter emphasizes that the table is edited after every experience, and those edits gradually shape better behavior.

5. Why do repeated episodes help the Q-table “fill in” and the pet make better choices?

Show answer
Correct answer: More experiences provide more chances to update values across different state-action pairs
Running many episodes creates many updates, spreading learned value information through the table and improving decisions.

Chapter 4: Exploration vs. Exploitation (Teaching Good Habits)

In the last chapter, you built a Q-table: a simple memory that estimates how good each action is in each state. That immediately raises a tempting idea: “If we already know the best action, why not always take it?” This chapter is about why that instinct can quietly ruin learning—and how to teach your virtual pet the good habit of trying new things without becoming reckless.

Reinforcement learning is a loop: the agent (your pet brain) observes a state from the environment (hunger, boredom, energy, time-of-day, etc.), chooses an action (eat, sleep, play, train), and receives a reward (good or bad). A Q-table only improves if it sees enough variety: different states and different actions, including the “boring” ones that initially look worse. Exploration is how the agent collects that evidence.

But exploration must be controlled. Too little exploration and the pet gets stuck in a habit loop (like always sleeping because it once got a small reward). Too much exploration and the pet looks random forever, never settling into a reliable routine. The key is balancing exploration (try actions to learn) with exploitation (use what you’ve learned to score well). In practice, you implement this balance with an epsilon-greedy action selection policy and tune a few training settings safely.

This chapter will give you an engineering workflow: (1) implement epsilon-greedy, (2) schedule epsilon over training, (3) keep beginner-safe hyperparameter ranges for learning rate and discount, (4) measure progress across many episodes (not one), and (5) use simple logs to spot bad patterns like loops and persistent randomness. By the end, your checkpoint is consistent improvement across episodes—not perfection, but a trend you can trust.

Practice note for Why “always pick the best known action” can fail: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Epsilon-greedy choices: controlled curiosity: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Tune training settings: learning rate, discount, exploration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Spot bad learning patterns: loops and random behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: consistent improvement across episodes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Why “always pick the best known action” can fail: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Epsilon-greedy choices: controlled curiosity: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Tune training settings: learning rate, discount, exploration: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: The exploration problem (kid-in-a-new-playground analogy)

Section 4.1: The exploration problem (kid-in-a-new-playground analogy)

Imagine dropping a kid into a brand-new playground. If they sprint to the first slide they see and then refuse to try anything else, they might miss the swings, the climbing wall, or the hidden tunnel that’s actually the most fun. “Always pick the best known thing” fails because the “best known thing” is based on incomplete experience.

Your virtual pet has the same issue. Early in training, most Q-values start at 0 (or some neutral default). If the pet happens to get a small positive reward from one action early—maybe “sleep” reduces hunger slightly in your simplified rules—it may repeatedly exploit that action. The Q-table then gets lots of updates for sleep, and almost none for play, eat, or train. The pet looks consistent, but it’s consistently under-trained.

This is how bad habits form in RL: the agent over-commits before it has evidence. You’ll commonly see two failure modes:

  • Premature exploitation: the pet repeats a mediocre action because it got lucky early.
  • Local optimum loops: the pet cycles between two actions/states that provide “okay” reward but block access to higher reward paths (for example, “eat → sleep → eat → sleep” while never playing, so happiness stays low).

Exploration is your mechanism to make the agent behave like the curious kid: try multiple pieces of equipment long enough to learn what’s truly best. The tricky part is doing this without turning training into chaos. That’s exactly what epsilon-greedy gives you: curiosity with a dial.

Section 4.2: Epsilon-greedy step by step

Section 4.2: Epsilon-greedy step by step

Epsilon-greedy is a simple rule for choosing actions: with probability ε (epsilon), explore by picking a random action; with probability 1 − ε, exploit by picking the action with the highest Q-value in the current state. It’s popular because it’s easy to implement and easy to reason about.

Here is the step-by-step workflow you should follow each time the pet must act:

  • Step 1: Observe state (e.g., hunger=high, energy=low, boredom=medium).
  • Step 2: Generate a random number r in [0, 1).
  • Step 3: If r < ε, choose a random action (explore).
  • Step 4: Else, choose argmax_a Q(state, a) (exploit).
  • Step 5: Apply action in the environment, get reward and next state.
  • Step 6: Update Q-table using your Q-learning update rule, then continue.

Two practical details matter a lot for beginners. First, define how you break ties when multiple actions share the same max Q-value. If you always pick the first action, you accidentally bias behavior and reduce exploration even when ε is low. A common fix is: if several actions are tied for max, pick randomly among the tied actions.

Second, ensure “random action” really means uniform over valid actions in that state. If some actions are invalid (e.g., “eat” when no food is available), either remove them from the action set for that state or assign a clear negative reward. Silent invalid actions can create misleading Q-values and make the agent look random for the wrong reason.

Epsilon-greedy won’t magically solve everything, but it gives you controlled curiosity. The pet still learns from rewards, but it also gathers enough diverse experience to stop confusing luck with skill.

Section 4.3: Scheduling epsilon (start curious, end focused)

Section 4.3: Scheduling epsilon (start curious, end focused)

Using a fixed epsilon is workable, but it’s rarely ideal. Early training should be adventurous because the Q-table is mostly empty. Later training should be focused because you want consistent behavior. This motivates an epsilon schedule: start with higher exploration, then reduce it over episodes.

A practical mental model: in episodes 1–50, you want the pet to “try everything and see what happens.” In episodes 200–500, you want it to “mostly do what works, occasionally double-check.” The schedule is how you encode that shift.

Common schedules that are beginner-friendly:

  • Linear decay: ε decreases by a fixed amount each episode until it hits a minimum. Easy to debug.
  • Exponential decay: ε = max(ε_min, ε_start * decay^episode). Often smooth and stable.
  • Step decay: reduce ε in chunks (e.g., drop after every 50 episodes). Good when you want predictable phases.

Keep a non-zero floor, like ε_min = 0.01 or 0.05. Why? Because environments can be stochastic, and your learned policy can drift into a brittle pattern that only works under certain random outcomes. A tiny amount of exploration acts like ongoing quality assurance.

Engineering judgement: decay epsilon based on evidence, not superstition. If your episode rewards are still volatile and trending upward, you may be decaying too quickly. If rewards plateau early and the pet repeats a loop, you may be decaying too slowly or your reward design may be encouraging the wrong habit. A useful checkpoint is to compare behavior at ε=0.3 versus ε=0.05: the first should look curious; the second should look purposeful.

Section 4.4: How hyperparameters change behavior (beginner safe ranges)

Section 4.4: How hyperparameters change behavior (beginner safe ranges)

In Q-learning, three settings strongly shape behavior: learning rate (α), discount factor (γ), and exploration (ε). If your pet seems stuck, random, or overly cautious, these are the first knobs to check—after confirming your state and reward definitions make sense.

Learning rate α (alpha) controls how quickly new experiences overwrite old beliefs. High α makes the pet adapt quickly but can cause wobbling when rewards are noisy; low α makes learning stable but slow. Beginner-safe range: 0.1 to 0.5. If your environment is very noisy, lean lower (0.1–0.2). If it’s deterministic and simple, you can use 0.3–0.5.

Discount factor γ (gamma) controls how much future rewards matter. A pet that should plan ahead (eat now to have energy to play later) needs a fairly high γ. Beginner-safe range: 0.85 to 0.99. Lower γ makes the pet short-sighted, often chasing immediate small rewards and forming loops. Higher γ encourages long-term routines but can slow learning if rewards are sparse.

Exploration ε (epsilon) controls trial behavior, not memory updates. Typical starting ε: 0.2 to 0.5. Typical minimum ε: 0.01 to 0.05. If the pet looks random late in training, your ε may be too high (or your state space is too large for the number of episodes). If the pet locks into a habit too early, ε may be too low or decaying too fast.

Common beginner mistake: changing all three at once. Instead, change one parameter, run multiple training runs (next section), and compare average episode reward curves. This isolates cause and effect. Another mistake is using extreme rewards to “force” behavior; that can create brittle strategies and unstable Q-values. Prefer moderate, well-shaped rewards and adjust hyperparameters gradually.

Section 4.5: Handling randomness (repeat runs and averages)

Section 4.5: Handling randomness (repeat runs and averages)

Reinforcement learning is noisy by nature. Even with the same code, two training runs can diverge because early random actions lead to different experiences, which produce different Q-values, which produce different future actions. If you judge progress from a single run, you can be fooled by luck—good or bad.

To make conclusions you can trust, adopt a simple evaluation habit: repeat runs and average results. For example, train the same setup with 5–20 different random seeds. Track the episode reward for each run, then compute the mean (and ideally the standard deviation). This tells you whether your improvement is consistent or fragile.

Also use moving averages within a run. Episode rewards often bounce because exploration intentionally injects “bad” actions. A 20-episode moving average smooths that noise so you can see the trend. When your chapter checkpoint says “consistent improvement across episodes,” it usually means the moving average rises over time, not that every single episode is better than the last.

Two practical patterns to watch for:

  • Looping with stable mediocre reward: the curve plateaus early and stays flat. This often indicates premature exploitation, poor reward shaping, or too-low γ.
  • Persistent randomness: the curve never rises and variance stays high. This often indicates ε staying high, α too high (unstable learning), or states that are too detailed for the amount of training.

When you compare settings, keep evaluation separate from training. A clean method is: train with epsilon-greedy, then run a short “test” phase with ε=0 (pure exploitation) to see what policy the pet actually learned. This prevents exploration noise from hiding whether the Q-table is improving.

Section 4.6: Debugging with simple logs (what to print and why)

Section 4.6: Debugging with simple logs (what to print and why)

When behavior looks wrong, resist the urge to guess. Add a few lightweight logs that tell you what the agent believed and why it acted. You don’t need fancy dashboards; a consistent text log can reveal loops, mis-specified rewards, and broken exploration in minutes.

At minimum, print or record these fields for a small sample of episodes (for example, the first 2 episodes, then every 50th):

  • episode, step (to locate where problems happen)
  • state (the exact state key used in the Q-table)
  • epsilon (so you know how exploratory the agent was)
  • action chosen and whether it was explore or exploit
  • reward and next_state
  • Q(s,a) before and after update
  • best_action and max Q(s,·) (to see what the agent thinks is best)

With that information, you can diagnose specific bad learning patterns. If the pet is stuck in a loop, the log will show repeated state-action pairs and often a reward that is accidentally positive (or not negative enough) for the looping behavior. If the pet looks random late in training, check whether epsilon is actually decaying, whether tie-breaking is biasing choices, and whether Q-values are staying near zero (a sign the pet isn’t receiving informative rewards).

A practical debugging routine: pick one problematic episode, replay it with logs at every step, and ask three questions: (1) Was the action chosen due to exploration or exploitation? (2) Did the reward match your intended “good habit” rules? (3) Did the Q-value update move in the expected direction? If any answer surprises you, you’ve found where to fix the environment rules, reward shaping, or training settings. This is how you turn “my pet is weird” into a concrete, correctable engineering issue.

Chapter milestones
  • Why “always pick the best known action” can fail
  • Epsilon-greedy choices: controlled curiosity
  • Tune training settings: learning rate, discount, exploration
  • Spot bad learning patterns: loops and random behavior
  • Checkpoint: consistent improvement across episodes
Chapter quiz

1. Why can “always pick the best known action” quietly ruin learning in a Q-table agent?

Show answer
Correct answer: It reduces exploration, so the agent doesn’t gather enough evidence about other actions and states
A Q-table improves by seeing varied state–action outcomes; always exploiting can lock in a mediocre habit before better options are tried.

2. In an epsilon-greedy policy, what does increasing epsilon primarily do?

Show answer
Correct answer: Makes the agent choose random actions more often to explore
Epsilon controls the exploration rate: higher epsilon means more random choices to gather information.

3. Which training outcome best signals too little exploration?

Show answer
Correct answer: The pet gets stuck repeating a small-reward habit loop (e.g., always sleeping)
Too little exploration can trap the agent in a loop based on early, limited experience.

4. Which workflow best matches the chapter’s recommended approach to balancing exploration and exploitation?

Show answer
Correct answer: Implement epsilon-greedy, schedule epsilon over training, keep safe hyperparameter ranges, and use logs to spot loops or randomness
The chapter emphasizes controlled exploration via epsilon-greedy, tuning, and diagnosing behavior with episode-level trends and logs.

5. What is the chapter’s checkpoint for success during training?

Show answer
Correct answer: Consistent improvement across many episodes (a trustworthy trend, not perfection)
Progress should be judged by a stable improvement trend across episodes rather than a single run or instant perfection.

Chapter 5: Measure Progress and Improve the Rewards

Up to now, your virtual pet has been learning through trial and error: it takes an action, the environment changes, and it gets a reward (or penalty). That loop is the heart of reinforcement learning—but it is also where beginners get misled. A pet can “look” smarter in a few episodes just because it got lucky. Or it can truly improve while still having occasional bad days. In this chapter you will add simple, reliable ways to measure progress, and you will refine the reward rules so the agent learns faster without learning the wrong thing.

The workflow is: (1) define success metrics that match your real goal (like average score and survival time), (2) visualize training with basic charts and moving averages, (3) adjust rewards carefully (reward shaping) to guide learning, (4) watch for reward hacking—when the agent exploits your rules rather than solving the intended task—and (5) make the environment slightly harder in a controlled way to create a stronger, more reliable pet agent.

Remember: when you change rewards or states you are not “fixing the agent,” you are rewriting the problem. That’s normal. The skill is doing it safely, making one change at a time, and measuring whether the change really helps.

Practice note for Define success metrics: average score and survival time: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Visualize learning: simple charts and moving averages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Reward shaping: guide the pet without “cheating”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prevent reward hacking: when the pet exploits your rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: a stronger, more reliable pet agent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define success metrics: average score and survival time: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Visualize learning: simple charts and moving averages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Reward shaping: guide the pet without “cheating”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prevent reward hacking: when the pet exploits your rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: a stronger, more reliable pet agent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: What to measure (and what not to trust)

Before you tune anything, decide what “good” means. For a virtual pet, two beginner-friendly success metrics are average score per episode and survival time (how many steps the pet stays alive or above “game over” thresholds). Score summarizes the reward your rules produce; survival time captures whether the pet avoids catastrophic states. If your pet can survive for a long time but earns low score, it may be playing too safely. If score is high but survival is low, it may be taking risky actions for short-term reward.

Track metrics at the episode level: after each episode ends, log total reward (episode return) and steps survived. Then compute an average over many episodes. A single episode is noisy; reinforcement learning has randomness from exploration, environment dynamics, and initial state. Trust trends, not individual wins.

  • Good to measure: average return, median return (more robust than mean), survival steps, % episodes that reach a “healthy” target, and the distribution (min/max) of returns.
  • Easy to misread: the last episode only, the best episode ever, and training-time reward without comparing to evaluation-time reward.
  • Engineering habit: separate training runs (with exploration on) from evaluation runs (epsilon near 0) so you can tell whether the Q-table actually learned a stable policy.

A common mistake is to optimize a metric that is too close to your current reward definition. If your reward gives +1 for “eating” and -1 for “being hungry,” the pet may learn to trigger “eating” repeatedly even if it ignores sleep. Survival time helps you detect these imbalances. Another mistake is changing two things at once: if you adjust rewards and also change epsilon decay, you won’t know which change caused the improvement.

Section 5.2: Training curves (reading them like a story)

Once you log episode return and survival steps, plot them. The simplest visualization is a line chart with episode number on the x-axis and total reward on the y-axis. For beginners, the key technique is a moving average, such as a rolling mean over the last 50 or 100 episodes. The raw curve will look jagged because exploration forces occasional “bad choices.” The moving average tells the real story: is the pet improving overall?

Read training curves like a narrative with chapters:

  • Early phase: low and chaotic returns. The pet is mostly exploring. Survival time might be short and variable.
  • Learning phase: moving average rises. Variance may remain high, but the baseline improves.
  • Plateau: the curve flattens. This can mean the problem is solved, or it can mean learning is stuck due to weak state representation, poor rewards, or not enough exploration.
  • Collapse: performance drops after improving. Often caused by too aggressive learning rate, changing environment rules mid-run, or a reward hack that later backfires.

Use two charts: one for average return and one for survival time. If return improves but survival does not, your reward is likely pushing “point scoring” rather than robustness. If survival improves but return does not, the agent may be avoiding both good and bad outcomes. That’s a cue to revisit reward balance.

Practical tip: plot evaluation curves every N episodes. For example, after every 200 training episodes, run 20 evaluation episodes with epsilon set to 0.01 (almost greedy). Plot the evaluation average separately. This prevents you from confusing “learning” with “exploration noise.”

Section 5.3: Reward shaping basics (small hints vs. big bribes)

Reward shaping means adding extra reward signals to guide the agent toward good behavior sooner. Done well, shaping is like giving hints. Done poorly, it is like bribing the pet to do something unrelated to the real goal.

A safe mindset is: keep the main objective reward intact (for example, +10 for staying healthy at episode end, or -10 for “game over”), and add small, local shaping rewards that point in the same direction. For a virtual pet, examples include: a small penalty each step hunger is high, a small bonus when hunger decreases, or a small penalty when energy hits a critical threshold. These make learning less “blind” than only rewarding at the end.

  • Small hints: -0.1 per step of critical hunger; +0.2 when hunger moves from “high” to “medium.”
  • Big bribes (usually risky): +5 every time the pet eats, regardless of context; +3 every time it sleeps, even if it is already fully rested.

Why are big bribes risky? Because Q-learning maximizes expected return; it will gladly farm a repeatable reward loop even if the pet’s overall wellbeing does not improve. Keep shaping rewards smaller than your terminal success/failure rewards so the agent still cares most about survival and long-term health.

Engineering judgement: introduce shaping one piece at a time, and watch how the training curves change. If learning gets faster but evaluation performance gets worse, you may have created a shortcut that only works during exploration or only works under specific random conditions.

Section 5.4: Sparse vs. dense rewards (trade-offs for beginners)

Reward designs fall on a spectrum from sparse to dense. Sparse rewards are rare: the pet might get 0 most steps, then +10 at the end if it survives, and -10 if it fails. Dense rewards give frequent feedback: small bonuses and penalties almost every step.

Sparse rewards are conceptually clean and harder to exploit, because there are fewer “reward tokens” to game. The downside is that learning can be slow: the agent must stumble into good long-term behavior before it receives any signal that it was good. Beginners often interpret this slow progress as “my Q-table is broken,” when the real issue is that the agent rarely experiences success.

Dense rewards speed up learning by giving gradient-like hints. But they increase the risk of teaching the wrong lesson. If you reward “eating” frequently, the pet may overeat. If you reward “moving” to explore, it may pace endlessly even when it should rest.

A practical compromise for a first virtual pet is a hybrid:

  • Keep a clear terminal signal: big negative for failure, big positive for a healthy episode outcome.
  • Add a few small dense terms tied to state quality (hunger, energy, mood) and state changes (improving vs. worsening), not just actions.
  • Include a tiny per-step penalty (like -0.01) to encourage efficiency and prevent stalling behaviors that “wait out” the episode.

When you move from sparse to dense rewards, expect the average score chart to change scale. This is normal. Compare policies by evaluation behavior and survival time, not only by the raw total reward number.

Section 5.5: Reward hacking examples and fixes

Reward hacking happens when the agent finds a way to earn reward that violates your intent. It is not “cheating” in a human sense; it is exactly what you asked for mathematically. Your job is to notice it early and adjust the environment rules or rewards.

Common virtual-pet reward hacks:

  • Action farming: you give +1 for “eat,” so the pet eats every step, even when not hungry.
  • Oscillation: you reward improvement (hunger decreasing), so the pet intentionally gets hungry, then eats, repeating the cycle to harvest improvement rewards.
  • Stalling: you reward survival time but have no per-step cost, so the pet enters a safe low-reward loop and never pursues higher wellbeing.
  • Boundary abuse: if states are bucketed (e.g., hunger “medium” vs “high”), the pet hovers at the boundary to trigger repeated “improvement” events due to noise.

Fixes should be targeted and minimal:

  • Reward state quality (being in a healthy range) more than actions (pressing “eat”). For example, reward “hunger is low” rather than “took eat action.”
  • Use cooldowns or diminishing returns: eating twice in a row gives less benefit, or is disallowed for a few steps.
  • Add a small step cost so infinite loops are not attractive.
  • Cap shaping rewards so one behavior cannot dominate total return (for example, max +2 shaping per episode).

After a fix, rerun evaluation episodes. A good sign is when survival time and evaluation return improve together, and the pet’s behavior looks stable across many random starts—not just one lucky scenario.

Section 5.6: Making the environment slightly harder safely

Once the agent is improving, it is tempting to “upgrade” the world dramatically: more states, more actions, more randomness. But big changes can erase what you learned about debugging and can destabilize training. Instead, make the environment slightly harder in controlled steps, and use checkpoints to keep a stronger, more reliable pet agent.

Safe ways to increase difficulty:

  • Add mild randomness: hunger decreases at slightly variable rates, or food sometimes restores a bit less energy. This tests robustness.
  • Start-state randomization: begin episodes with different hunger/energy levels so the Q-table learns recovery skills, not just maintenance.
  • Longer episodes: increase max steps so the agent must manage resources over time, not just sprint to the end.
  • Small new constraint: add a “cost” to actions (e.g., playing reduces energy) so trade-offs matter.

Do this with a checkpoint routine: save your Q-table and training settings when evaluation performance reaches a new best. Then introduce one environment change, retrain, and compare against the checkpoint using the same evaluation protocol. If performance drops, you can roll back and try a smaller change.

Also consider training settings as part of “safe difficulty.” When the environment becomes harder, you may need slower epsilon decay (more exploration), a slightly lower learning rate for stability, or more episodes. The goal is not just a higher chart—it is a pet that behaves well across varied situations. That is the checkpoint for this chapter: reliable improvement you can measure, explain, and reproduce.

Chapter milestones
  • Define success metrics: average score and survival time
  • Visualize learning: simple charts and moving averages
  • Reward shaping: guide the pet without “cheating”
  • Prevent reward hacking: when the pet exploits your rules
  • Checkpoint: a stronger, more reliable pet agent
Chapter quiz

1. Why does Chapter 5 emphasize using metrics like average score and survival time instead of judging progress from a few episodes?

Show answer
Correct answer: Because a few episodes can look good due to luck, while metrics show more reliable trends
The chapter warns that short-term performance can be misleading; success metrics provide a more dependable view of learning progress.

2. What is the main purpose of using moving averages when visualizing training?

Show answer
Correct answer: To smooth noisy results so the overall learning trend is easier to see
Moving averages reduce short-term fluctuations, making it easier to interpret whether learning is improving.

3. Which statement best describes reward shaping as presented in this chapter?

Show answer
Correct answer: Adjusting rewards carefully to guide learning faster without teaching the wrong behavior
Reward shaping is about refining reward rules to guide learning while still aligning with the intended task.

4. What is reward hacking in the context of this chapter?

Show answer
Correct answer: The agent exploits the reward rules to get high reward without actually solving the intended task
Reward hacking happens when the agent finds loopholes in your reward design rather than learning the desired behavior.

5. According to the chapter, what is the safest way to improve the system when you change rewards or states?

Show answer
Correct answer: Make one change at a time and measure whether it truly helps
The chapter frames reward/state changes as rewriting the problem and recommends controlled, measurable adjustments.

Chapter 6: Package Your First RL Project (and What’s Next)

Up to this point, you’ve trained a virtual pet by iterating quickly: change a reward, rerun training, glance at scores, repeat. That is exactly how learning RL feels at the beginning. Now you’ll turn that notebook-style work into a small, clean project you can run again, share with others, and extend safely.

This chapter focuses on engineering judgment: separating responsibilities (environment vs. agent vs. training loop), making results reproducible, saving what the pet learned, and running a fair evaluation that distinguishes “trained behavior” from “tested behavior.” You’ll also look ahead: how to scale your pet to bigger worlds without getting overwhelmed, and how to recognize when Q-learning with a Q-table is reaching its limits.

By the end, you should be able to present a simple demo: start from scratch, train, save, reload, test, and show a small chart or printed metrics that prove your pet improved.

Practice note for Refactor into a clean project: environment, agent, training loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Save and load what the pet learned: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run a final evaluation: training vs. testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Next steps: bigger worlds and deeper methods (no overwhelm): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Final checkpoint: present your virtual pet RL demo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Refactor into a clean project: environment, agent, training loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Save and load what the pet learned: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run a final evaluation: training vs. testing: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Next steps: bigger worlds and deeper methods (no overwhelm): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Final checkpoint: present your virtual pet RL demo: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Refactor into a clean project: environment, agent, training loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Clean structure (separating concerns simply)

Section 6.1: Clean structure (separating concerns simply)

The fastest way to make an RL project painful is to mix everything together: environment rules, Q-table updates, epsilon-greedy action selection, plotting, and printing all in one file. It works once, then becomes hard to trust and harder to extend. The goal is not “enterprise architecture.” The goal is to separate concerns so you can change one part without breaking the rest.

A clean beginner structure usually has three pieces:

  • Environment: defines states, available actions, transition rules, rewards, and when an episode ends.
  • Agent: holds the Q-table and the learning rule (update step), plus action selection (exploration vs. exploitation).
  • Training loop: runs episodes, logs scores, manages epsilon schedule, and decides when to save.

Practically, this can be three files: env.py, agent.py, train.py. Your environment should expose methods like reset() (return initial state) and step(action) (return next_state, reward, done, info). Keep rendering (printing the pet mood, hunger, etc.) optional; it’s helpful for demos but slows training.

Your agent should not know “pet rules.” It should only know how to map (state, action) to Q-values and update them: Q[s,a] = Q[s,a] + alpha * (reward + gamma * max_a' Q[s',a'] - Q[s,a]). A common mistake is letting the agent call environment internals directly (“if hunger high then…”). That turns learning into hard-coded behavior and hides bugs.

In the training loop, keep a clear episode boundary: reset, iterate steps until done, record total reward. This makes it straightforward to add saving/loading later and to compare training vs. testing without accidental differences.

Section 6.2: Reproducibility (seeds and consistent results)

Section 6.2: Reproducibility (seeds and consistent results)

RL is noisy. Two runs with the “same” code can learn different behaviors because randomness changes which experiences the agent sees early on. Beginners often interpret this as “my code is broken,” when the real issue is that the experiment isn’t controlled. Reproducibility doesn’t remove randomness—it makes it manageable.

Start by choosing a single seed and applying it everywhere you use randomness:

  • Your language RNG (e.g., Python’s random.seed)
  • Numerical RNG (e.g., numpy.random.seed)
  • Environment RNG (e.g., a dedicated rng = np.random.default_rng(seed) stored inside the environment)

Then store the seed in your run configuration and print it at startup. If you later discover a bug, you can rerun the exact same experience stream to confirm the fix. This is especially important when you change rewards or states: without fixed seeds, you can’t tell whether performance changed because of your design choice or because the run “got lucky.”

Next, make training settings explicit: number of episodes, max steps per episode, learning rate (alpha), discount (gamma), exploration schedule (starting epsilon, ending epsilon, decay). Put these in a small config dict or a simple JSON file. A common mistake is tweaking constants scattered across files, then forgetting what produced a “good run.”

Finally, log results consistently. At minimum, store per-episode total reward and length. If you chart anything, chart the same thing each run (for example, a moving average of total reward). Reproducibility is not bureaucracy—it’s what lets you learn from your own experiments instead of guessing.

Section 6.3: Evaluation episodes (fair testing rules)

Section 6.3: Evaluation episodes (fair testing rules)

Training metrics can lie. During training, the agent is exploring (taking random actions some of the time), and the environment might also be randomized. If you only look at training rewards, you can’t tell whether the learned policy is actually good or just occasionally lucky.

A fair workflow is to split your runs into two modes:

  • Training: epsilon-greedy with exploration enabled; Q-table updates occur.
  • Testing (evaluation): exploration disabled (epsilon = 0, or “always exploit”); Q-table is frozen (no updates).

Evaluation should be done over multiple episodes (for example, 20–100) and summarized with an average reward and a spread (min/max or standard deviation). One episode is not evidence. Also keep evaluation rules consistent: same max steps, same environment difficulty settings, and ideally a different seed from training so you measure general behavior, not memorization of one random sequence.

This is also where saving and loading matters. A strong check is: train, save the Q-table to disk, start a fresh process, load it, and then evaluate. If performance collapses, you likely depended on some hidden state in memory (for example, forgetting that you were still updating Q-values during “testing”).

Common mistakes include evaluating with epsilon still > 0 (“my agent is worse than before!”—it’s just exploring), comparing runs with different episode lengths, or changing reward definitions between training and testing. Treat evaluation like a small scientific experiment: fixed rules, frozen policy, repeated trials.

Section 6.4: Common extensions (more states, more actions)

Section 6.4: Common extensions (more states, more actions)

Once your project is structured, extending it becomes safer. The virtual pet is a great sandbox because small changes immediately create new learning challenges. The trick is to extend in a controlled way so you can still diagnose what changed.

Typical extensions fall into three categories:

  • More states: add “energy,” “cleanliness,” “boredom,” or “time of day.” If your state is discrete, define bins (e.g., hunger 0–2 instead of 0–100). This keeps the Q-table manageable.
  • More actions: add play, sleep, clean, pet, or explore. Ensure each action has a clear intended effect and a plausible cost (time, energy, risk).
  • Richer rewards: reward “healthy” behavior (balanced hunger/energy), not just a single goal. Add small penalties for neglect (e.g., hunger too high) and small step costs to discourage endless loops.

Engineering judgment: change one thing at a time. If you add two new state variables and three new actions and rewrite rewards, you won’t know what caused learning to improve or fail. A practical pattern is: add one feature, rerun training with the same seeds and settings, then evaluate. If learning breaks, roll back and simplify.

Also watch for reward loopholes. If “sleep” gives a positive reward and has no downside, the agent may sleep forever. A simple fix is a step penalty (e.g., -0.01 each step) or making actions trade off (sleep restores energy but increases hunger). Your environment should model trade-offs so the agent must learn priorities instead of exploiting a free reward.

Section 6.5: When Q-tables stop working (the “too many states” issue)

Section 6.5: When Q-tables stop working (the “too many states” issue)

Q-tables are perfect for learning the basics because they are transparent: you can print values and see what the agent “believes.” But they have a hard limit: the table grows with #states × #actions. If your pet state includes multiple variables with many possible values, the number of combinations explodes.

For example, suppose you discretize: hunger (10 bins), energy (10), cleanliness (10), mood (10), and location (20). That’s 10×10×10×10×20 = 200,000 states. With 8 actions, your Q-table has 1.6 million values. That’s not impossible in memory, but it becomes difficult to learn because the agent must visit many state-action pairs repeatedly to get reliable estimates.

Symptoms you’ve hit the “too many states” wall:

  • Training rewards improve very slowly or not at all, even with many episodes.
  • Small changes in seed cause huge differences in outcome (learning is fragile).
  • Most states are rarely visited; the agent behaves well only in familiar situations.

Before jumping to deep learning, you still have practical options:

  • Coarser discretization: fewer bins (e.g., hunger low/medium/high).
  • Feature pruning: remove state variables that don’t change decisions much.
  • Smarter episode design: start episodes from varied initial states to cover more of the state space.

Saving and loading helps here too. You can train longer across multiple sessions by persisting the Q-table (e.g., JSON, CSV, or a binary format). Just save not only Q-values, but also the mapping of states/actions and your discretization rules. A common mistake is loading a Q-table after changing the state definition—values no longer align, and behavior becomes nonsensical.

Section 6.6: Roadmap to deeper RL (what to learn next and why)

Section 6.6: Roadmap to deeper RL (what to learn next and why)

When the world grows beyond what a table can handle, you replace the table with a function that estimates Q-values. That’s the core idea behind deeper RL: instead of storing Q(s,a) in a lookup table, you approximate it with a model that can generalize across similar states.

A low-overwhelm roadmap from here:

  • Better bookkeeping first: keep your clean project structure, reproducible runs, and separate training/testing. These habits carry into every advanced method.
  • Function approximation basics: learn the idea of using a linear model or small neural network to predict Q-values from state features.
  • DQN (Deep Q-Network): the classic next step for discrete actions. Key concepts to learn: experience replay (store transitions and sample batches) and target networks (stabilize learning).
  • Policy gradients: when actions are continuous or you want a direct policy. Start with REINFORCE conceptually, then look at actor-critic methods.
  • Environment design: reward shaping, curriculum learning (start easy, get harder), and how to avoid teaching “shortcut” behavior.

As a final checkpoint for this course, package your virtual pet demo like a tiny product: one command to train, one command to evaluate, and one option to render a short episode. In your demo, show: (1) training curve (moving average reward), (2) saved Q-table loaded into a fresh run, and (3) evaluation results over multiple episodes with exploration off. If you can do that, you’ve moved from “I ran RL once” to “I built an RL experiment I can trust and extend.”

Chapter milestones
  • Refactor into a clean project: environment, agent, training loop
  • Save and load what the pet learned
  • Run a final evaluation: training vs. testing
  • Next steps: bigger worlds and deeper methods (no overwhelm)
  • Final checkpoint: present your virtual pet RL demo
Chapter quiz

1. Why does Chapter 6 emphasize separating the environment, agent, and training loop into distinct parts?

Show answer
Correct answer: To separate responsibilities so the project is easier to rerun, share, and extend safely
The chapter focuses on engineering judgment: clean separation makes the RL project reproducible, maintainable, and safer to extend.

2. What is the main purpose of saving and loading what the pet learned?

Show answer
Correct answer: To reuse a trained policy/Q-values without retraining from scratch and to support reproducible demos
Persisting what was learned lets you reload and demonstrate or continue training without repeating the full training process.

3. What does a fair final evaluation need to distinguish, according to the chapter?

Show answer
Correct answer: Trained behavior during learning vs. tested behavior after training
The chapter stresses evaluation that separates training performance from testing performance to avoid misleading conclusions.

4. What is the point of making results reproducible in this packaged project?

Show answer
Correct answer: So you (and others) can rerun the same project and trust comparisons when you make changes
Reproducibility supports reliable iteration and sharing by making it easier to compare runs and confirm improvements.

5. When scaling to bigger worlds, what does Chapter 6 suggest you watch for with Q-learning using a Q-table?

Show answer
Correct answer: It can reach its limits as the problem grows, signaling a need for deeper methods
The chapter previews that Q-tables can struggle as environments get larger, motivating more advanced approaches.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.