Reinforcement Learning — Beginner
Create a beginner-friendly AI that learns by trial and reward
This beginner course is a short, book-style journey into reinforcement learning, one of the most exciting areas of AI. Instead of asking a computer to learn from a fixed list of answers, reinforcement learning teaches an AI to improve through practice. It tries actions, gets rewards or penalties, and slowly learns what works best. If you have never studied AI, coding, or data science before, this course was designed for you.
You will start with the core idea in plain language: learning by trial and error. From there, you will build a simple world for your AI, define what it can do, and show it how to improve one step at a time. Each chapter builds on the one before it, so you never have to guess what comes next. By the end, you will have created your first small learning AI and understand why it gets better with practice.
Many reinforcement learning resources assume you already know programming, advanced math, or machine learning terms. This course does the opposite. It explains every concept from first principles and uses small examples to make abstract ideas feel concrete. You will not be rushed into difficult theory. Instead, you will build confidence through simple progress.
Throughout the course, you will create a tiny learning agent that makes decisions inside a simple environment. First, you will define the rules of that environment. Then, you will decide what counts as a good result by assigning rewards. After that, you will guide the AI through repeated practice rounds so it can improve its choices over time.
You will also learn how a value table works, why an AI sometimes needs to try new actions, and how to tell whether learning is really happening. These ideas are the foundation of reinforcement learning, and they will help you understand more advanced systems later on.
This course is organized like a short technical book with six chapters. Chapter 1 introduces the basic idea of AI that learns through experience. Chapter 2 helps you build the world your AI will explore. Chapter 3 explains how the AI starts making better choices. Chapter 4 turns that logic into a simple learning algorithm. Chapter 5 helps you improve results and fix common mistakes. Chapter 6 brings everything together in a small project you can complete and share.
Because the structure is progressive, every chapter feels useful and connected. You are not just reading definitions. You are building understanding in the right order, with each piece supporting the next.
Reinforcement learning is used in game-playing systems, robotics, recommendation engines, and decision-making tools. While this course keeps things simple, the basic ideas you learn here are the same ones behind much larger AI systems. Understanding rewards, actions, and learning loops gives you a strong foundation for future study.
If you want a practical and friendly first step into AI, this course is a great place to begin. It focuses on understanding, not memorization, and helps you build something real from the start.
By the end of this course, you will be able to explain reinforcement learning clearly, read simple agent logic, design rewards, and build a small AI that improves with practice. You will also know how to test your results and how to continue learning after the course ends.
If you are ready to begin, Register free and start building your first learning AI. You can also browse all courses to explore more beginner-friendly AI topics.
Machine Learning Educator and AI Systems Specialist
Sofia Chen teaches artificial intelligence to beginners through simple, hands-on learning experiences. She has designed practical AI lessons for students, career changers, and teams who need clear explanations without heavy math or jargon.
Reinforcement learning is one of the most intuitive ways to think about artificial intelligence because it mirrors a familiar human pattern: try something, observe what happens, and adjust next time. In this course, you will build your first reinforcement learning AI by starting from that simple idea rather than from heavy math. The goal of this chapter is to help you form a clean mental model before you write code. If you understand the moving parts clearly, the code in later chapters will feel logical instead of mysterious.
At its core, reinforcement learning is about decision-making through experience. An AI system is placed in a situation, makes a choice, receives feedback, and gradually learns which choices tend to lead to better outcomes. This makes reinforcement learning different from many other types of AI. In image classification, for example, the model is usually shown correct answers directly during training. In reinforcement learning, the system often has to discover good behavior by interacting with a world, sometimes making many poor choices before it improves.
This chapter introduces that idea in practical, everyday terms. You will see how trial-and-error learning works in daily life, what makes reinforcement learning distinct, and how the four core parts of a learning system fit together: agent, environment, action, and reward. You will also preview a tiny learning problem that we can solve with beginner-friendly code. Finally, you will set expectations for the simple tools used in this course so you can focus on understanding rather than setup confusion.
As you read, keep one engineering mindset in view: reinforcement learning is not magic. It is a structured loop. The AI observes, acts, gets feedback, and updates what it will do next. Good results depend on good problem framing, clear rewards, and patient testing. A small example that works cleanly teaches more than a flashy example you do not fully understand.
By the end of this chapter, you should be able to explain reinforcement learning in simple everyday language, identify the four main parts of a learning system, and read a small piece of code with a clear sense of what it is trying to achieve. That foundation matters because every later chapter builds on this loop of actions, outcomes, and improvement.
Practice note for See how trial-and-error learning works in daily life: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Understand what makes reinforcement learning different: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Name the four core parts of a learning system: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up the simple tools used in this course: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for See how trial-and-error learning works in daily life: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
To learn from experience means improving behavior based on the results of past actions. This sounds obvious for people, but it is a powerful idea for AI. Imagine touching a hot pan once and then becoming much more careful around the stove. You did not need a spreadsheet of labeled examples. You acted, got feedback, and updated your future behavior. Reinforcement learning uses that same pattern in a computational form.
In practice, a reinforcement learning system does not begin with common sense. It starts with little or no knowledge of which choices are good. It tries actions in some environment, receives rewards or penalties, and slowly builds a preference for decisions that lead to better long-term results. The important phrase here is long-term. Sometimes a choice feels good immediately but causes problems later. A learning agent must discover that difference through repeated interaction.
This is one reason reinforcement learning can be challenging and interesting. The system often learns from delayed consequences, not just instant feedback. If a robot takes three correct turns and then reaches a goal, which turn mattered most? If a game agent collects a coin but falls into a trap later, was the coin worth it? These are practical questions about credit assignment, and they appear even in simple beginner projects.
From an engineering perspective, learning from experience is a loop: observe the current situation, choose an action, measure the outcome, and update the policy or value estimate. A policy is simply a rule for choosing actions. At the beginning, the rule may be nearly random. Over time, it becomes more informed. A common beginner mistake is expecting the agent to improve after just a few attempts. Real learning usually requires many repetitions, even in toy problems.
A useful habit is to think of the agent as gathering evidence. Each attempt adds a little information. One success does not prove an action is always good, and one failure does not prove it is always bad. Better learning comes from repeated trials, careful reward design, and checking whether behavior improves consistently rather than accidentally.
Daily life is full of reinforcement learning patterns. A child learning to ride a bicycle adjusts balance after each wobble. A commuter experiments with different routes to work and eventually learns which path is fastest at certain times of day. A person trying to cook better learns from taste, timing, and small mistakes. In all these cases, there are choices, outcomes, and some form of reward. The reward might be explicit, like arriving earlier, or informal, like satisfaction from a smoother ride.
These examples matter because they show that reinforcement learning is not a strange technical trick. It is a formal way to model adaptive decision-making. The system does not need a teacher to explain every correct move in advance. Instead, it uses feedback from the world. That feedback can be positive, negative, immediate, delayed, noisy, or incomplete. In real engineering work, dealing with imperfect feedback is often more important than writing the update formula itself.
Consider a phone keyboard suggesting words. If a user frequently accepts certain suggestions and ignores others, the system can learn preferences. Or imagine training a vacuum robot. It receives positive feedback for covering new floor efficiently and negative feedback for getting stuck or wasting battery. The interesting part is that no single step explains the full task. The robot must learn sequences of useful actions.
When designing beginner projects, simple rewards are best. If the reward rule is unclear, the agent may learn surprising or useless behavior. For example, if you reward a game agent for moving but forget to reward reaching the goal, it may learn to wander endlessly. This is a classic reinforcement learning lesson: the agent follows the reward you actually define, not the intention you had in mind.
A practical takeaway is to describe behavior in terms of measurable outcomes. Ask: what should count as success, what should count as failure, and what should be neutral? Clear answers make later coding far easier. Ambiguous rewards create confusing training results and make debugging much harder than necessary.
Every reinforcement learning system can be understood through four core parts. First is the agent, the learner or decision-maker. This is the AI that chooses what to do. Second is the environment, the world the agent interacts with. It could be a game board, a grid of rooms, a robot simulator, or any system that reacts to actions. Third are the actions, the possible choices the agent can make at each step. Fourth is the reward, the feedback signal that tells the agent whether an outcome was desirable.
These four pieces form the central loop of reinforcement learning. The environment presents a situation. The agent selects an action. The environment changes state and returns a reward. The agent then uses that information to improve future decisions. If you understand this loop, you understand the backbone of the field.
Let us make it concrete with a simple maze example. The agent is a small character. The environment is the maze grid. The actions are move up, down, left, or right. The reward might be +10 for reaching the exit, -5 for hitting a trap, and -1 for each step to encourage shorter paths. Even this tiny setup can produce meaningful learning. Over repeated episodes, the agent discovers which actions tend to lead to the exit efficiently.
Good engineering judgment starts with naming these parts clearly before coding. Write down the agent, the environment, the action set, and the reward rule in plain language. Many beginner bugs happen because one of these definitions is vague. For example, if an action should be impossible at a wall, what exactly happens? Does the agent stay still? Does it get penalized? Does the episode end? Those choices affect learning behavior.
Another common mistake is treating reward as the same thing as success. Reward is a numeric signal used during learning, not a philosophical truth. You must design it carefully. Small negative step costs, large goal rewards, and penalties for failure are common patterns, but each changes what the agent learns to optimize. A useful mindset is that rewards are instructions written in numbers.
A reinforcement learning agent improves over time because it updates its behavior using evidence from repeated interaction. At first, the agent may act randomly or nearly randomly. This is not a flaw. It is part of the learning process. The agent must explore enough to discover which actions lead to good outcomes. If it never tries new actions, it may get stuck with mediocre behavior. If it explores forever without using what it has learned, it never settles into strong performance. Balancing these two needs is one of the central ideas in reinforcement learning.
This balance is often described as exploration versus exploitation. Exploration means trying actions to gather information. Exploitation means using the best-known action so far. Suppose an agent in a tiny game has learned that moving right often leads to a reward. It may choose right most of the time, but occasionally trying another direction could reveal an even better path. In beginner systems, this is often handled with a simple rule like epsilon-greedy behavior, where the agent usually picks the best-known action but sometimes acts randomly.
The reason improvement happens is that experience changes the agent's estimates. It starts assigning higher values to actions or states associated with better rewards. Over many trials, useful patterns become stronger. Bad actions become less preferred. This can be stored in a simple table in small problems, which is ideal for learning the basics before moving to larger neural-network methods.
One practical caution is that progress is rarely smooth. Training curves may go up, then down, then up again. Randomness, sparse rewards, and incomplete exploration can make early learning look unstable. Beginners often assume something is broken when the agent performs poorly in the first few episodes. Sometimes it is broken, but often the solution is simply more episodes, clearer rewards, or better tracking of results.
To judge improvement, measure behavior over many runs, not just one lucky success. Log average reward, success rate, or average steps to goal. Reinforcement learning becomes much easier to reason about when you replace vague impressions with simple metrics. Even in a tiny project, measurement is part of learning engineering, not an optional extra.
Before building large systems, it helps to start with a tiny learning problem that is small enough to understand fully. Imagine a one-dimensional world with five positions in a row. The agent starts near the left side. It can choose only two actions: move left or move right. The goal is at the far right and gives a reward of +1. Each non-goal step gives 0, or perhaps a small negative reward like -0.01 to encourage efficiency. This may sound almost too simple, but it teaches the core loop cleanly.
In this problem, you can easily reason about what good behavior should look like. Moving right repeatedly should eventually reach the goal. Moving left is usually unhelpful. But the agent does not know that at the beginning. It needs to try actions, experience the result, and update its internal estimates. This is exactly the kind of small environment where tabular reinforcement learning shines. You can represent knowledge as a table of state-action values without introducing heavy frameworks.
When you later read code for this kind of problem, focus on a few key pieces. How is the current state represented? How are legal actions listed? Where is the reward assigned? When does an episode end? And how does the agent update what it has learned after each action? These questions let you read reinforcement learning code as a workflow instead of as isolated syntax.
A common beginner mistake is to jump directly into a complex game and lose track of the learning logic. Start with tiny worlds where you can print every state and action if needed. If the agent fails, you should be able to inspect exactly what happened. Simplicity improves your debugging ability. It also teaches a valuable engineering lesson: if a method does not work in a tiny, controlled setting, it is unlikely to work in a bigger one.
By the end of this course, you should be comfortable creating and testing such small learning problems yourself. That skill transfers well. Once you can define states, actions, and rewards cleanly in simple systems, scaling up becomes a matter of complexity, not confusion.
The tools for this course should stay as light as possible. For your first reinforcement learning projects, you do not need an advanced research stack. A basic Python environment, a code editor, and a way to run scripts are enough. In many cases, standard Python plus simple libraries like NumPy will carry you through the early chapters. The point is not to build the most powerful agent immediately. The point is to understand the loop of observation, action, reward, and update with confidence.
Your setup should support experimentation. That means you can edit a file, run the program, print intermediate values, and make small changes quickly. If your environment is complicated or fragile, you will spend your energy fighting tools instead of learning reinforcement learning. This is why beginner-friendly engineering favors clear scripts and tiny environments over heavy abstractions.
A practical workflow for the course looks like this: first, run a tiny baseline agent, even if it acts randomly. Second, inspect the environment and verify rewards are given when expected. Third, add a simple learning rule. Fourth, track metrics such as total reward per episode. Fifth, tune one thing at a time, such as the learning rate or exploration amount. This step-by-step habit is more valuable than rushing to a final result.
There are also mindset tools you will need. Expect the first version to be imperfect. Expect bugs in reward design, episode termination, and state updates. Expect to print tables and inspect logs. Reinforcement learning often rewards patience and careful observation more than clever code. The beginners who improve fastest are usually the ones who test small pieces and reason about them clearly.
Your learning path in this book is designed to build confidence. You will move from everyday intuition to small code examples, from small code examples to simple agents, and from simple agents to evaluation and improvement. If you keep your focus on the core loop and on measurable behavior, you will not just run reinforcement learning code. You will understand what it is doing and why it improves.
1. What is the basic idea behind reinforcement learning in this chapter?
2. How is reinforcement learning different from many other types of AI described in the chapter?
3. Which set names the four core parts of a reinforcement learning system introduced in the chapter?
4. According to the chapter, what should learners focus on before writing code?
5. What does the chapter say leads to good reinforcement learning results?
Before an AI can learn, it needs a world to live in. In reinforcement learning, that world is called the environment. The environment is not just a background scene. It defines what the agent can see, what it is allowed to do, what happens after each choice, and how success or failure is measured. If Chapter 1 introduced the main cast of reinforcement learning, this chapter is where we build the stage they act on.
A beginner-friendly way to think about an environment is as a small game with rules. The game does not need graphics or complexity. In fact, simpler is better. A tiny grid, a line of numbered positions, or a room with a goal square is enough to teach the core ideas. The agent starts somewhere, takes actions such as moving left or right, and receives rewards or penalties based on what happens. Over time, those signals guide learning.
The quality of your environment matters a lot. If the rules are confusing, the rewards are inconsistent, or the stopping conditions are unclear, your AI may learn strange habits or fail to learn at all. This is an important engineering lesson: when reinforcement learning goes wrong, the model is not always the first thing to blame. Often the world you designed is sending mixed messages.
In this chapter, we will build a very small environment from first principles. We will define a clear goal, choose the actions the AI can take, decide what information counts as the current state, and track rewards, penalties, and outcomes. We will also run manual practice rounds ourselves before writing code. That last step may feel old-fashioned, but it is one of the most effective habits in machine learning engineering. If a human cannot simulate the environment by hand, the computer will almost certainly struggle with it too.
To keep the ideas concrete, imagine a simple one-dimensional world: five positions in a row, numbered 0 through 4. The agent starts at position 2. Position 4 is the goal and gives a positive reward. Position 0 is a bad outcome and gives a penalty. At each step, the agent may move left or right. The episode ends when it reaches either end. This small setup is enough to teach how environments work, how rewards guide behavior, and how trial and error can improve decisions over time.
As you read, pay attention to the design choices. Why not allow three actions instead of two? Why give a small step penalty or choose not to? Why end the episode at the goal instead of letting the agent keep moving? These are not cosmetic decisions. They shape what the AI can discover. Reinforcement learning is as much about designing the right problem as it is about solving it.
By the end of this chapter, you should be able to look at a beginner RL problem and describe it clearly: what the agent notices, what actions are possible, what outcomes matter, and when an attempt starts and stops. That clarity is the foundation for writing correct code and for understanding why a learning agent behaves the way it does.
Practice note for Create a simple environment with rules and goals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Define the actions your AI can take: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Track rewards, penalties, and outcomes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Reinforcement learning becomes much easier to understand when you turn a real task into a small game. A game has a start, a set of rules, legal moves, and a way to win or lose. That is exactly what an RL environment needs. The trick is not to capture every detail of the real world. The trick is to capture only the details that matter for learning the behavior you care about.
Suppose we want an AI to learn how to move toward a goal while avoiding a bad outcome. We can model that with a line of five positions: 0, 1, 2, 3, and 4. The agent begins at position 2. If it reaches 4, it succeeds. If it reaches 0, it fails. On each turn, it can move one step left or one step right. This is small enough to reason about, but still rich enough to demonstrate trial and error, consequences, and improvement.
Good environment design starts with a few practical questions. What counts as success? What counts as failure? What choices are realistic? What information does the agent need? If you cannot answer these simply, the task is probably too vague. A common beginner mistake is adding complexity too early, such as random teleports, too many positions, or many overlapping reward rules. That usually hides the core lesson instead of teaching it.
Engineering judgment matters here. Simplicity is not weakness. A tiny world is a test bench. It lets you verify that the learning loop works before you scale up. In professional work, teams often create a minimal environment first, confirm the reward logic, and only then add more states, obstacles, or uncertainty. If learning fails in the tiny version, adding complexity will not fix it. It will only make debugging harder.
So our first environment will have clear goals, clear boundaries, and very few moving parts. That makes it ideal for reading code later and for spotting mistakes quickly.
The state is the information the environment gives the agent at a particular moment. In our simple world, the most useful state is just the agent's current position: 0, 1, 2, 3, or 4. That single number tells the agent where it is and helps it decide what to do next. If it is at 3, moving right is attractive because the goal is nearby. If it is at 1, moving left is risky because failure is close.
A strong beginner habit is to ask: what is the smallest state that still supports good decisions? In this toy world, position alone is enough. We do not need the full history of moves, the time of day, or a description of all past rewards. Adding unnecessary details makes learning slower and harder to interpret. State design should be informative, but also compact.
There are two common mistakes here. The first is giving the agent too little information. If the state hid the current position entirely, the agent would have no basis for choosing left or right. The second is giving too much irrelevant information, which can distract the learner and expand the problem size for no benefit. In larger projects, poor state design is one of the biggest reasons RL systems perform badly.
Another practical point: state should describe the world before the action is chosen. After the action, the environment updates to a new state. This separation is important for clean code and clean thinking. At each step: observe state, choose action, apply rules, receive reward, move to next state. That loop is the heartbeat of reinforcement learning.
For our environment, we will track at least these pieces of information internally: current position, whether the episode is done, and the latest reward after an action. The visible state can remain just the current position. That keeps the learning problem simple while still letting the environment manage outcomes correctly.
Actions are the choices the agent can make at each step. In our world, the action set is deliberately tiny: left or right. This may seem almost too simple, but that simplicity is useful. It isolates the central RL idea that choices lead to consequences. When the action space is small, you can focus on whether the rewards and transitions make sense instead of getting lost in too many possibilities.
Defining actions means deciding what the AI is allowed to control. That is an engineering decision, not just a coding detail. If actions are too weak, the agent may be unable to solve the task. If they are too broad or unrealistic, the task may become trivial or unstable. For example, if we allowed the agent to jump directly to any position, it would bypass the whole point of learning step-by-step movement. If we allowed a "stay still" action, the agent might exploit it depending on the reward design.
Every action should have a clear effect. In our environment, moving left subtracts 1 from the position unless the episode is already over. Moving right adds 1. The environment then checks whether the new position is terminal. This is a good pattern: define actions in concrete, deterministic terms before you add randomness. Beginner projects benefit from predictable transitions because they are easier to test.
A common mistake is forgetting to define what happens at boundaries. What if the agent is at 0 and chooses left? In our design, that situation should never occur during active play because the episode ends at terminal states. That keeps the rules clean. Another option would be to keep the agent at 0 and apply a penalty, but that changes the learning signals. The important point is consistency. The environment should not improvise.
By keeping the action set small and explicit, you make the first learning problem understandable. Later chapters can expand the choices, but the core workflow begins here.
Rewards are how you communicate priorities to the agent. They do not tell it exactly what action to take. Instead, they score what happened after an action. In our environment, we can use a simple reward scheme: +1 for reaching position 4, -1 for reaching position 0, and 0 for all non-terminal moves. This is enough to teach the basic idea that some outcomes are good and others are bad.
Reward design sounds easy, but it is one of the hardest practical parts of reinforcement learning. The AI will optimize what you reward, not what you meant. If your rewards are vague or conflicting, the agent may learn surprising shortcuts. For example, if we accidentally gave +1 for every move instead of just reaching the goal, the agent might prefer wandering forever rather than finishing. That is not the AI being foolish. It is the AI following the scoring system you wrote.
You can also shape behavior with smaller signals. Some environments use a small step penalty like -0.01 on every move to encourage faster solutions. That can be helpful, but beginners should use it carefully. Extra reward terms make the system more expressive, but also easier to mis-specify. Start with the simplest reward that reflects success and failure. Add shaping only when you have a clear reason.
In code, each action should produce a reward and an outcome. Outcome tracking means recording whether the agent reached the goal, hit a failure state, or is still in progress. This is useful not only for learning, but also for debugging and evaluation. When training later, you will often inspect logs such as total reward per episode, number of successful episodes, and average steps to finish.
The practical rule is simple: reward what truly matters, avoid accidental loopholes, and keep the signals understandable enough that you can explain them out loud.
An episode is one complete attempt from start to finish. In our world, an episode begins when the agent starts at position 2 and ends when it reaches either terminal state: 4 for success or 0 for failure. Episodes matter because reinforcement learning is organized around repeated attempts. The agent tries, receives rewards, stops, resets, and tries again.
Clear stopping points are essential. Without them, the environment can drift into endless loops, making training noisy and evaluation confusing. In some tasks, the stopping point is obvious, like reaching a goal square. In others, you also add a maximum step limit, such as ending the episode after 10 moves even if no terminal state has been reached. That is a useful safety feature. It prevents badly designed policies from running forever and gives you a consistent structure for analysis.
Goals should be explicit and measurable. "Do well" is not a goal. "Reach position 4 before reaching position 0" is a goal. Good RL environments define success in a way that can be checked by code every single step. This sounds basic, but it prevents a lot of beginner confusion. If you cannot write the stopping condition as a precise rule, your environment is not ready.
There is also an engineering benefit to terminal states. They create natural data summaries. At the end of each episode, you can report total reward, number of steps, final position, and whether the outcome was success or failure. Those summaries will become your first evaluation metrics. Later, when the AI starts learning, you will compare episodes over time to see whether behavior is improving.
For our environment, the episode logic is straightforward: start at 2, keep stepping until the agent reaches 0 or 4, then reset for the next run. That structure gives us a clean unit of experience for training and testing.
Before writing or trusting learning code, test the environment manually. This is one of the best practical habits in reinforcement learning. You act as the agent, choose actions step by step, and verify that the world responds exactly as intended. Manual practice rounds reveal logic bugs early: wrong rewards, missing stop conditions, impossible moves, or state updates that happen in the wrong order.
Let us run two sample episodes. Episode A: start at position 2. Choose right, move to 3, reward 0, episode continues. Choose right again, move to 4, reward +1, episode ends with success. Episode B: start at 2. Choose left, move to 1, reward 0, continue. Choose left again, move to 0, reward -1, episode ends with failure. These two tiny traces already validate much of the environment design.
While testing, write down each step in a small table or notebook:
This simple record makes hidden problems visible. If a reward appears before the move instead of after it, you will notice. If the episode keeps going after reaching position 4, you will notice. If reset does not return to position 2, you will notice. Manual testing gives you confidence that when the agent behaves strangely later, the issue is more likely in the learning logic than in the environment itself.
A common beginner mistake is skipping this step because the environment feels "too simple to test." But simple environments are exactly where you should build good habits. In real projects, environment bugs can waste hours or days because the learner will still produce output, just based on broken rules. Hand simulation is a fast reality check.
Once your manual runs match your expectations, you are ready for code. At that point, your environment is no longer an abstract idea. It is a fully specified world with states, actions, rewards, outcomes, and clean episode boundaries. That is the foundation your first learning AI will explore in the next chapter.
1. In reinforcement learning, what is the main role of the environment?
2. Why does the chapter recommend starting with a very simple environment?
3. What problem can happen if an environment has confusing rules or inconsistent rewards?
4. In the chapter's one-dimensional example, when does an episode end?
5. Why does the chapter suggest running manual practice rounds before writing code?
In the last part of the course, you met the basic idea of reinforcement learning: an agent takes actions inside an environment and receives rewards. That idea is simple, but a useful learning system needs one more ingredient: memory. If an AI could not remember which moves helped and which moves hurt, it would behave like a beginner forever. This chapter is about giving your AI a very small but powerful memory so it can improve step by step.
We will use a beginner-friendly tool called a value table. A value table is exactly what it sounds like: a table of scores that estimates how good different choices are. Instead of storing giant amounts of data or using complicated math, we start with a small table that the agent updates after each result. This makes reinforcement learning visible. You can inspect the table, watch values rise or fall, and reason about why the agent starts preferring some actions over others.
That visibility matters. When people first learn reinforcement learning, they often focus too quickly on code or formulas and forget the engineering judgement behind the system. A learning agent is not just “running”; it is collecting evidence. Every move produces information. A reward says, in effect, “this direction seems useful” or “this direction seems costly.” Over time, the AI remembers these signals and turns random choices into informed choices.
This chapter also introduces one of the most important balancing acts in reinforcement learning: sometimes the agent should repeat a move that already seems good, and sometimes it should try something new in case there is an even better option. That tradeoff is called exploration versus exploitation. In practical projects, this balance is where many beginner agents either get stuck too early or wander forever without improving. You will learn how to think about that tradeoff in plain language and how to see it in code and results.
Finally, we will look at learning over many rounds, often called episodes. Reinforcement learning almost never looks impressive after one or two attempts. Improvement appears through repetition. The agent tries, gets feedback, updates its values, and tries again. As these episodes accumulate, the pattern of behavior changes. The AI starts making better decisions not because it was told the perfect answer, but because it learned from trial and error. That is the heart of reinforcement learning, and this chapter turns that idea into something concrete, testable, and readable.
By the end of this chapter, you should be able to read a tiny learning loop, explain what it is doing, and judge whether the agent is actually improving. That combination of concept, workflow, and interpretation is what turns reinforcement learning from a buzzword into an engineering tool.
Practice note for Understand how the AI remembers good and bad moves: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Use a simple value table to guide decisions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Balance trying new moves with repeating good ones: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
At the beginning of training, a reinforcement learning agent usually knows nothing. If it has four possible moves, each move is just a guess. That is not a flaw; it is the correct starting point. The system has not yet earned any reason to prefer one action over another. In early episodes, behavior may look clumsy, inconsistent, or wasteful. A beginner often sees this and thinks the code is broken. Usually, the agent is simply acting before it has enough experience.
The shift from random behavior to informed behavior happens when the AI starts attaching memory to actions. Imagine a tiny maze game. The agent moves left, right, up, or down. If moving right near the goal often leads to a reward, the AI should gradually remember that “right here is promising.” If moving into a wall produces no progress or a penalty, that move should lose appeal. Over time, the agent is no longer choosing blindly. It is using stored experience.
This is the key practical mindset: reinforcement learning is not magic decision-making, it is evidence-weighted decision-making. The agent tries actions, sees outcomes, and updates what it believes. In a simple beginner project, these beliefs can be stored as numbers in a table. Larger systems use more advanced function approximators, but the learning logic is the same: experience should change future choices.
From an engineering point of view, this means you should not evaluate the agent too early. A single bad move says very little. A pattern across many attempts says much more. Also, you should keep the environment simple enough that useful patterns can be learned. If your state descriptions are confusing or rewards are inconsistent, the agent cannot form reliable preferences. Good beginner design means the AI has a fair chance to connect actions to outcomes.
A common mistake is to think the reward alone is the intelligence. It is not. Reward is the signal; memory is what turns that signal into improved behavior. Without storing and updating some estimate of action quality, the agent just keeps reacting moment by moment. With memory, random choices begin turning into informed choices, which is exactly the transformation we want.
A value table is one of the clearest ways to teach an AI to remember good and bad moves. In simple terms, it is a lookup table that stores a score for actions, or for state-action pairs. The score represents the agent’s current estimate of how useful that choice is. If a move often leads to better results, its value should rise. If it tends to lead to poor results, its value should fall.
Suppose your environment has a few states, like positions on a small grid, and in each state the agent can take a few actions. The table might look like this in concept: for state A, action left has value 0.1, action right has value 0.8, action up has value -0.2, and action down has value 0.0. Those numbers are not facts about the world. They are the AI’s current guesses based on what it has experienced so far. That is an important distinction. Values begin as rough estimates and improve through training.
The practical power of a value table is transparency. You can print it. You can inspect it after every episode. You can compare values before and after a reward. That makes debugging much easier than in systems where learning is hidden inside a complex model. If the agent keeps making a poor move, you can ask: did the table ever assign a strong value to that move? If not, maybe the reward signal is too weak, the update rule is wrong, or the state representation is missing useful information.
Engineering judgement matters here too. A value table works best when the number of states and actions is small enough to store directly. For a tiny project, that is ideal. If the environment becomes huge, tables become hard to manage because many state-action combinations may never be visited enough times. For this course, though, the table is the right tool because it helps you understand the learning mechanism without extra complexity.
A common beginner mistake is to treat the highest current value as permanent truth. Early in learning, values are noisy and incomplete. Another mistake is to define states too vaguely, so the same table entry mixes together situations that should be treated differently. A useful value table needs clear states, meaningful rewards, and enough repeated practice to make the stored numbers reflect real patterns.
The table becomes useful only when its values change in response to experience. After each action, the agent observes a result: perhaps a reward, perhaps a penalty, perhaps a transition to a better or worse state. Then it updates the stored value for the action it just took. This is the moment where learning actually happens.
At a practical level, the update process is simple: compare the old estimate with the new evidence, then move the value a little toward what the result suggests. If the move turned out better than expected, increase the value. If it turned out worse than expected, decrease it. The amount of adjustment is often controlled by a learning rate. A high learning rate makes the agent change its mind quickly. A low learning rate makes it more cautious and stable.
Here is the plain-language workflow. First, the agent is in a state. Second, it chooses an action. Third, the environment returns a reward and a new state. Fourth, the agent updates the table entry for the action it took. Fifth, the process repeats. In code, this is often just a few lines inside a loop, but those few lines are doing the main intellectual work of reinforcement learning: converting outcomes into improved future estimates.
Good engineering judgement means choosing update settings that match the problem. If rewards are noisy, changing values too aggressively can make the agent unstable. If updates are too tiny, learning can be painfully slow. You should also check whether the reward arrives immediately or after several steps. If the goal reward is delayed, then the update rule needs to help earlier actions receive some credit for eventually leading to success.
A common mistake is updating the wrong table entry, such as the current state instead of the state-action pair that caused the result. Another is forgetting that one surprising reward should not erase all previous experience. That is why gradual updates are useful. They let the agent learn from each result without overreacting. As more episodes accumulate, the values become less random and more informative, which is exactly what you want in a beginner-friendly learning AI.
Once your AI has a value table, a new question appears: should it always choose the action with the highest value? Sometimes yes, but not always. If the agent only repeats the current best-looking move, it may miss a better option it has not tested enough. On the other hand, if it keeps trying random actions forever, it may never benefit from what it has already learned. This is the exploration versus exploitation tradeoff.
Exploitation means using current knowledge. If the table says action right has the highest value in this state, exploitation means choosing right. Exploration means trying something else occasionally, even if it does not look best at the moment, because the agent needs more information. In everyday terms, exploitation is ordering your favorite meal because you know you like it. Exploration is trying a new dish because it might be even better.
A common beginner strategy is epsilon-greedy choice. Most of the time, the agent picks the action with the highest current value. A small percentage of the time, controlled by epsilon, it picks a random action instead. This method is popular because it is easy to implement and easy to reason about. If epsilon is too high, learning can look chaotic because the agent keeps ignoring its own knowledge. If epsilon is too low too early, the agent may settle for a mediocre policy and stop discovering alternatives.
Good engineering judgement often means changing exploration over time. Early in training, higher exploration makes sense because the AI knows very little. Later, you usually reduce exploration so the agent can make stronger use of what it has learned. This gradual reduction is often called epsilon decay. It reflects a sensible workflow: first gather information widely, then act more confidently.
The most common mistake here is assuming that randomness means failure. In reinforcement learning, controlled randomness is a tool. Another mistake is leaving exploration unchanged forever and then wondering why performance never stabilizes. A practical agent needs both curiosity and discipline. The art is not choosing one over the other, but balancing them so the AI can discover better moves and then reliably use them.
Reinforcement learning improves through repetition, not through a single perfect run. That is why we train over many practice episodes. An episode is one complete attempt, such as starting a gridworld at the beginning, taking actions until the goal is reached or a limit is hit, then resetting and trying again. Each episode gives the agent another chance to collect experience and update its value table.
When you run many episodes, learning patterns begin to emerge. At first, rewards may be low and paths may be inefficient. Then the agent starts stumbling into better actions more often. Later, if the reward setup and update rule are sensible, the behavior becomes more consistent. This progression is hard to notice in one episode but easy to see across dozens or hundreds.
The practical workflow is straightforward. Reset the environment. Let the agent interact step by step. After every action, record the reward and update the value table. Continue until the episode ends. Then store summary information such as total reward, number of steps, and whether the goal was reached. Repeat. Those summaries are important because they let you evaluate whether the agent is actually improving instead of merely changing.
Engineering judgement matters in deciding how many episodes to run and how long each episode should be. Too few episodes and the agent has not seen enough to learn. Episodes that are too long can waste time if the agent is stuck wandering. A sensible step limit prevents endless loops and keeps training efficient. You also need to keep reward design stable during a training run. If you keep changing the rules midway, the values in the table may become hard to interpret.
Common mistakes include stopping training after the first sign of improvement, ignoring average performance in favor of one lucky episode, and failing to reset the environment correctly between runs. A clean training loop is one of the most practical skills in reinforcement learning. It lets you watch learning happen over many rounds and gives you the evidence needed to improve the agent systematically.
Training an agent is only half the job. The other half is reading the results correctly. Beginner reinforcement learning projects often produce messy signals: one episode looks great, the next looks poor, and values keep shifting. That does not necessarily mean the system is failing. It means you need to evaluate progress with the right lens.
Start by looking at trends, not isolated episodes. Is the average reward increasing over time? Is the agent reaching the goal more often? Is the number of steps to success decreasing? These are practical indicators that the value table is becoming more useful. Also inspect the table directly. If successful actions in key states are getting higher values, that supports the idea that learning is happening for the right reason.
It also helps to compare behavior during training and behavior during evaluation. During training, exploration introduces randomness, so performance may look uneven. During evaluation, you can reduce or remove exploration and let the agent mostly exploit what it has learned. If results improve under evaluation conditions, that is a good sign that the table contains useful knowledge rather than accidental luck.
Engineering judgement is especially important when results are disappointing. If progress is flat, ask structured questions. Are rewards too sparse, so the agent rarely receives useful feedback? Is exploration too high or too low? Is the learning rate causing instability or stagnation? Are the states defined clearly enough for the value table to distinguish good situations from bad ones? These questions help you debug methodically instead of guessing.
A common mistake is to declare success because the agent solved the task once. Another is to declare failure because improvement is noisy. Reinforcement learning usually sits between those extremes. Real progress looks like a gradual shift in average outcomes and increasingly sensible value estimates. If you can read those signs, you can test, improve, and evaluate a beginner-friendly learning agent with confidence. That ability is what turns simple trial and error into disciplined reinforcement learning practice.
1. Why does a reinforcement learning agent need memory in this chapter’s approach?
2. What is the main purpose of a value table?
3. What does exploration versus exploitation mean?
4. Why does the chapter emphasize learning over many episodes?
5. If an agent’s stored values for helpful actions rise over time, what is the best interpretation?
In the earlier parts of this course, you learned the core idea of reinforcement learning: an agent tries actions, sees what happens, receives rewards, and slowly improves through experience. In this chapter, we turn that idea into code. This is an important step, because reinforcement learning can sound abstract until you see the loop running line by line. Once you understand the loop, the rest of beginner-friendly reinforcement learning becomes much easier to read, debug, and extend.
The main goal of this chapter is to build a tiny learning system using Q-learning. Q-learning is a classic reinforcement learning method that stores a score for each state-action pair. That score estimates how useful an action is in a given situation. If an action tends to lead to better rewards later, its score goes up. If it tends to lead to poor results, its score stays low or drops relative to better alternatives. The agent does not need to be told the best action ahead of time. It improves by trial and error.
Think of this like learning to move through a small maze. At each step, the agent is in one state, chooses one action, moves to a new state, and receives some reward. The code keeps repeating this process. Over time, the agent fills in a table of what seems promising. That table becomes its memory. In this chapter, you will learn how to express that memory in a simple data structure, how to choose actions during learning, how to update stored values after each move, and how to run the full training process from start to finish.
There is also an engineering mindset to develop here. A reinforcement learning algorithm is not just a formula. It is a workflow. You need to initialize values clearly, handle randomness carefully, update values in the right order, and reset the environment at the start of each episode. Small implementation mistakes can make the agent appear broken even when the idea is correct. A strong beginner learns to reason through the learning cycle step by step: what state am I in, what action did I choose, what reward did I get, what next state did I reach, and how should that change my stored estimate?
By the end of this chapter, you should be able to read a basic Q-learning loop and understand what each part is doing. You should also be able to build a small training script that runs many episodes, updates the Q-table after each action, and then checks whether the learned behavior is improving. This is one of the most useful milestones in a first reinforcement learning project, because now the system is no longer only a concept. It is a working learner.
Keep your focus on clarity rather than cleverness. A tiny, readable learning algorithm is more valuable at this stage than a more advanced version that hides the main idea. Once you can trace the loop with confidence, you will be ready to improve it in later chapters.
Practice note for Turn the learning idea into simple step-by-step code: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build a basic Q-learning loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Store and update values after each action: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
The heart of reinforcement learning is the learning loop. No matter how large or advanced a system becomes, the basic cycle remains recognizable: observe the current state, choose an action, apply that action to the environment, receive a reward, move to the next state, and update what the agent has learned. Then repeat. For a beginner, it helps to see this not as a mathematical object first, but as a practical program loop.
A very small version looks like this in plain language. First, reset the environment so the agent starts fresh. Next, keep running steps until the episode ends. At each step, ask the agent to choose an action for the current state. Send that action to the environment. The environment returns a next state, a reward, and a signal saying whether the episode is finished. Then update the agent's stored value for the action it just took. Finally, move the current state pointer to the next state and continue.
This loop matters because it captures both acting and learning. A common beginner mistake is to think the agent first learns somehow and then acts. In reinforcement learning, the agent learns while acting. Experience is the training data. Every action creates the next example the agent learns from. That is why correct ordering matters. If you update the wrong state, forget to move into the next state, or fail to reset at the start of a new episode, your code may run but your agent will not learn the intended behavior.
From an engineering point of view, keep the loop explicit and readable. Use variable names such as state, action, reward, next_state, and done. These names make the logic easy to inspect. If learning fails, you can print these values and follow the flow. In beginner projects, transparency beats compact code.
A practical mental model is this: each pass through the loop answers one question for the agent, which is, "Was that action in that situation better or worse than I expected?" The update step adjusts the expectation. Over many repetitions, the values become more accurate, and the agent begins choosing stronger actions more often.
The Q-table is the simplest useful memory structure for Q-learning. It stores a number for each combination of state and action. That number is the agent's current estimate of how good it is to take that action in that state. If your environment has a small number of states and actions, a table is easy to understand and easy to debug. This makes it perfect for a first learning algorithm.
Suppose your environment has 16 states and 4 possible actions. Then your Q-table can be a 16 by 4 grid of numbers. At the start of training, you usually fill it with zeros. That means the agent begins with no strong expectations. As it explores and receives rewards, some entries increase because they lead to better outcomes. Others remain low because they do not help much.
In code, this is often created with a two-dimensional array. Conceptually, q_table[state][action] gives the current estimated value. If you are using Python with NumPy, this setup is compact, but the main idea is independent of language. The key is that your state and action spaces must map cleanly to indexes. If the environment returns a state identifier, make sure it matches the row index you expect. If action numbers start at zero, your columns should follow the same order.
A common mistake is to misunderstand what the table stores. It does not store the reward from the last action only. It stores an estimate of long-term usefulness. That is why a move with a small immediate reward can still become valuable if it often leads to a much better future state. The table gradually captures this longer view through repeated updates.
Use engineering judgment here as well. A Q-table works well only when the number of states and actions is manageable. For toy grids, small games, and beginner environments, it is excellent. For very large or continuous environments, other methods are needed later. But for a first project, the Q-table gives you something concrete: you can print it, inspect it, and watch learning happen numerically.
Once the Q-table exists, the agent needs a rule for choosing actions. This is where reinforcement learning introduces one of its most important practical ideas: the balance between exploration and exploitation. Exploitation means choosing the action that currently looks best according to the Q-table. Exploration means sometimes trying something else, even if it does not look best yet. Without exploration, the agent may get stuck repeating early guesses. Without exploitation, it may wander forever and never use what it has learned.
The standard beginner method is epsilon-greedy action selection. With probability epsilon, the agent picks a random action. With probability 1 minus epsilon, it picks the action with the highest Q-value for the current state. This simple rule works well because it guarantees some experimentation while still allowing the agent to increasingly favor better actions.
For example, if epsilon is 0.2, then about 20 percent of the time the agent explores, and about 80 percent of the time it exploits what it knows so far. Early in training, a higher epsilon is often useful because the table is mostly empty knowledge. Later, epsilon is often reduced so the agent relies more on learned values. This gradual reduction is called epsilon decay.
One practical detail matters: when multiple actions have the same value, especially at the start when many values are zero, your code may always choose the first one unless you handle ties carefully. That can accidentally reduce exploration. Random tie-breaking or explicit epsilon-based exploration helps avoid this bias.
Another beginner mistake is assuming random action selection means failure. In fact, random moves are part of learning. The agent cannot discover better paths if it only repeats its current favorite action. Good engineering judgment means accepting short-term messiness for long-term improvement. During training, some episodes will look inefficient. That is normal. What matters is whether the action choices become better on average as the Q-table improves.
The update step is where learning actually happens. After the agent takes an action and sees the reward and next state, it adjusts the Q-value for the state-action pair it just used. The purpose is to bring the old estimate closer to a better estimate based on new experience. In Q-learning, that better estimate combines the immediate reward with the best expected future value from the next state.
The standard update rule is: old value plus learning rate times the difference between target and old value. The target is the reward plus discount factor times the maximum Q-value in the next state. Written compactly, it is often shown as Q(s,a) = Q(s,a) + alpha * (reward + gamma * max(Q(next_state)) - Q(s,a)). Even if the formula looks intimidating at first, its meaning is intuitive. If the result was better than expected, increase the value. If it was worse than expected, correct downward or let other actions become relatively stronger.
The learning rate, often called alpha, controls how quickly the table changes. A high alpha means the agent updates aggressively. A low alpha means it changes slowly and averages more experience. The discount factor, gamma, controls how much future rewards matter. A gamma near zero makes the agent short-sighted. A gamma closer to one makes it care more about future outcomes.
Common implementation mistakes happen here. One is using the wrong next state when computing the target. Another is forgetting that terminal states need special handling. If an episode ends, there may be no future value to add, so the target may just be the final reward. A third mistake is updating the table before capturing the values returned by the environment step. Keep the data flow clean and ordered.
This section is where theory becomes practical. Every action produces a tiny correction. One correction does almost nothing. Thousands of corrections produce learning. If you remember that the update is just repeated expectation adjustment, Q-learning becomes much less mysterious.
One episode is rarely enough for meaningful learning. The agent needs many chances to try actions, make mistakes, and revise its Q-values. That is why training is organized across episodes. An episode begins with an environment reset and ends when the task reaches a terminal condition, such as reaching a goal, falling into a failure state, or hitting a maximum number of steps.
The outer loop of your program runs for a chosen number of episodes. Inside that outer loop sits the step-by-step learning loop. This structure is simple but powerful. Each episode gives the agent another opportunity to improve its estimates. Over time, the Q-table reflects patterns that repeat across many runs rather than the noise of one lucky or unlucky sequence.
In practice, you should track useful metrics while training. A beginner-friendly choice is total reward per episode. Another is number of steps until the episode ends. If the task is to reach a goal efficiently, you might hope to see rewards increase and steps decrease over time. Logging these values helps you tell whether learning is happening or whether something in the algorithm needs fixing.
This is also where hyperparameters matter in a practical sense. If epsilon never decreases, the agent may keep behaving too randomly. If alpha is too large, learning may become unstable. If gamma is badly chosen, the agent may ignore important future rewards or overvalue distant possibilities. You do not need perfect settings at the start, but you should expect to test and adjust them.
A useful engineering habit is to keep training code deterministic when possible by setting random seeds. This does not remove all variability, but it makes debugging much easier. If the same code gives wildly different outcomes every run, it is hard to tell whether a change helped. Reproducibility is not just for large research projects. It is extremely valuable even in tiny beginner experiments.
After training, you need a simple way to inspect what the agent has learned. Without evaluation, a learning algorithm is just a loop that produced numbers. The first thing to view is the Q-table itself. In small environments, printing the table can reveal clear patterns. States near the goal may show larger values for actions that move toward success. Bad states may show lower values or highlight escape actions.
Another simple evaluation method is to run the agent without exploration. Set epsilon to zero, reset the environment, and let the agent always choose the highest-valued action. Then watch the path it takes and record the total reward. This gives you a clean picture of the policy implied by the learned Q-table. If the training worked, the behavior should usually look more purposeful than it did at the beginning.
You can also summarize results numerically. Average reward over the last 100 episodes is a helpful signal. So is success rate if the environment has a clear win condition. For a very small project, even a printed list of rewards across episodes can tell a story. You do not need advanced charts to begin reasoning about performance.
Be careful not to overinterpret one good run. Reinforcement learning often includes randomness, so evaluate across several episodes. If the agent succeeds once but fails often, the learning is not yet reliable. Also remember that a non-perfect policy can still represent real progress. In beginner environments, the practical goal is not elegance. It is demonstrating that the agent improved through trial and error.
The most important outcome of this chapter is not only that you can run a full agent from start to finish. It is that you can explain what each part is doing: how the table stores knowledge, how the action rule balances exploration and exploitation, how rewards drive updates, and how repeated episodes turn isolated experiences into useful behavior. That ability to read, reason about, test, and improve a simple reinforcement learning agent is the foundation for everything that comes next.
1. What is the main purpose of Q-learning in this chapter?
2. Which sequence best matches the reinforcement learning loop described in the chapter?
3. Why does the chapter describe the Q-table as the agent's memory?
4. According to the chapter, what is an important implementation detail at the start of each episode?
5. What learning goal should you reach by the end of this chapter?
By this point, you have a small reinforcement learning agent that can take actions, receive rewards, and slowly improve through trial and error. That is a major step. But building a working agent is only the beginning. In practice, beginner agents often learn too slowly, behave inconsistently, or appear to improve for a short time and then get worse. This chapter is about recognizing those patterns and fixing them with simple, practical changes rather than complex theory.
Reinforcement learning can feel unpredictable because the agent is not following a fixed set of instructions. It is discovering behavior from experience. That means small choices in reward design, training settings, and evaluation can lead to very different outcomes. If your agent is wandering randomly for too long, repeating bad actions, or showing unstable performance from episode to episode, those are not signs that reinforcement learning is broken. They are signs that your setup needs better guidance.
A useful way to think about improvement is to separate the problem into four parts. First, make sure the rewards encourage the behavior you actually want. Second, choose simple training settings that give learning a fair chance to work. Third, measure progress with numbers instead of guesses. Fourth, remove common sources of noise and waste so the agent becomes more reliable and efficient. These are engineering decisions as much as they are learning decisions.
As you work through this chapter, keep a practical mindset. Change one thing at a time. Record the effect. If performance improves, keep the change. If it does not, undo it and try another. Beginners often make many changes at once, then cannot tell which one helped. Good reinforcement learning workflow is less about clever tricks and more about disciplined testing.
We will look at reward design, training settings, metrics, beginner mistakes, efficiency improvements, and the very important question of when to stop tuning. In a simple project, the goal is not perfect behavior. The goal is steady, explainable improvement that you can measure and trust. If you can tell whether the agent is truly getting better, and you can make it better with deliberate adjustments, then you are already thinking like a reinforcement learning practitioner.
Remember the overall picture: the agent interacts with an environment, takes actions, receives rewards, and updates its choices based on what happened. When results are poor, one of those links is usually weak. Either the reward is sending the wrong message, the settings are making learning unstable, or the evaluation is too shallow to reveal what is really happening. This chapter helps you diagnose those weak links and improve them without adding unnecessary complexity.
Practice note for Spot when learning is too slow or unstable: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Adjust rewards and settings for better behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Measure whether the AI is truly improving: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Make the agent more reliable and efficient: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Spot when learning is too slow or unstable: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Reward design is one of the strongest tools you have, and one of the easiest ways to accidentally create bad behavior. In reinforcement learning, the agent does not understand your intention. It only learns from the reward signal you provide. If you reward the wrong thing, even slightly, the agent may become very good at behavior you did not want.
Imagine a grid world where the goal is to reach an exit. If you give a reward only when the exit is reached and nothing else during the episode, learning may be very slow because the agent gets useful feedback only rarely. On the other hand, if you give a small negative reward for every step, the agent has a reason to finish sooner. That often improves learning because the reward gives more frequent direction. But if the step penalty is too large, the agent may prefer ending quickly in a bad state rather than exploring long enough to find the goal. This is why reward design can both help and hurt.
A practical rule is to start with the simplest reward that matches the task. Reward success clearly. Penalize obvious waste, such as unnecessary extra steps, but keep penalties modest. If you add many reward terms too early, you may create confusing signals. For a beginner agent, clear and consistent rewards usually work better than clever reward shaping.
Watch for reward loopholes. A loophole happens when the agent finds a high-reward pattern that does not match the real objective. For example, if you reward movement without caring where it leads, the agent may learn to move in circles forever. If you reward collecting points but not finishing the episode, it may delay completion just to gather more easy rewards. When behavior looks odd, ask a direct question: is the agent failing to learn, or is it learning exactly what my reward system asked for?
Engineering judgment matters here. If learning is too slow, consider more frequent feedback. If behavior becomes unnatural, simplify the reward. The best beginner reward systems are understandable enough that you can explain why each reward exists and what behavior it should encourage.
Many beginner problems come from training settings rather than from the core idea of reinforcement learning. A small agent can still fail if the learning rate is too high, exploration stays random for too long, or episodes are too short for the task. You do not need advanced tuning. You need sensible defaults and a simple workflow.
Start by choosing a moderate learning rate. If it is too high, the agent may overreact to new experiences and become unstable. You may see performance jump up and down instead of gradually improving. If the learning rate is too low, the agent may technically learn but so slowly that progress is hard to notice. For a tiny project, prefer a value that changes estimates gradually rather than aggressively.
Next, think about exploration. Early in training, the agent should try different actions so it can discover useful behavior. Later, it should rely more on what it has learned. A common beginner strategy is to begin with a higher exploration rate and slowly reduce it over time. If you never reduce exploration, your agent may keep making random mistakes even after it knows better. If you reduce exploration too quickly, it may settle too early on a weak strategy.
Episode length also matters. If the task requires 20 steps to reach the goal but your episodes stop after 10 steps, the agent never gets a fair chance to succeed. On the other hand, extremely long episodes can waste time if the agent is trapped in unproductive behavior. Set a maximum number of steps that is long enough for success but not so long that learning becomes inefficient.
A clean beginner workflow looks like this: choose one baseline setting for learning rate, one exploration schedule, and one episode limit. Run training. Record results. Then adjust one setting at a time. This avoids confusion and helps you understand cause and effect.
Simple settings often beat complicated ones because they are easier to reason about. Your goal is not to find mathematically perfect values. Your goal is to choose values that make the learning process understandable, stable, and easy to improve step by step.
If you only watch one or two episodes, reinforcement learning can look misleading. A lucky episode may make the agent seem better than it is. A random bad episode may hide real progress. This is why measurement matters. To know whether the AI is truly improving, track simple metrics over many episodes.
The first useful metric is win rate, or success rate. In a task with a clear goal, count how often the agent reaches that goal. This tells you whether it is solving the task more consistently over time. The second metric is the number of steps per episode. If success is increasing and steps are decreasing, the agent is likely becoming more efficient. In many environments, a good agent not only wins more often but also wastes fewer actions.
The third metric is average reward. This is often the main training signal, but it must be interpreted carefully. A rising average reward usually suggests improvement, but not always. If your reward design has loopholes, average reward can increase while true task performance stays weak. That is why average reward should be combined with direct task measures like wins and step count.
Use moving averages to smooth noisy results. Instead of looking at each episode alone, compute the average over the last 20, 50, or 100 episodes, depending on your project size. Smoothed curves make trends easier to see. This helps you spot whether learning is rising steadily, flattening out, or becoming unstable.
Keep training and evaluation separate when possible. During training, exploration adds randomness. During evaluation, reduce or remove exploration so you can measure what the agent has actually learned. This distinction is important. An agent may perform poorly during training because it is still exploring, while evaluation reveals that its learned policy is already decent.
When these metrics agree, you can trust the direction of progress. When they disagree, investigate. For example, if average reward rises but success rate does not, revisit the reward design. Good RL engineering depends on measurement that is simple, repeatable, and tied to the real objective.
Most beginner reinforcement learning problems are not mysterious. They come from a small set of repeat mistakes. Learning to recognize them quickly will save you a lot of time. One common mistake is assuming that more training always fixes everything. If the reward is wrong or the settings are unstable, extra training may only reinforce poor behavior. Before increasing episode count, check whether the setup makes sense.
Another common mistake is changing too many things at once. A beginner may modify rewards, learning rate, exploration, and episode length in a single experiment. Then if results improve or worsen, there is no clear reason why. A better approach is controlled testing: change one variable, run the agent, compare the metrics, and write down what happened.
A third mistake is judging performance from raw episode behavior alone. Because RL includes randomness, short-term results can be noisy. You need repeated runs or averaged results. Otherwise, you may call something a bug when it is just normal variance.
Code-level mistakes also matter. A frequent one is updating values with the wrong state or action. Another is forgetting to reset the environment correctly between episodes. Some agents keep learning from stale state information or carry over variables that should start fresh. These bugs create confusing behavior that looks like a learning problem but is really a program logic problem.
There is also the mistake of exploring forever. If the exploration rate stays too high, the agent may never look reliable even if it has learned a decent policy. Reduce exploration gradually and test the learned policy separately.
The fix for most beginner issues is disciplined troubleshooting. Observe the agent. Print or log key values. Run small test cases. If behavior is strange, ask whether the problem comes from rewards, settings, metrics, or code. This process makes reinforcement learning feel far less mysterious and far more manageable.
Improving efficiency does not always mean using a more advanced algorithm. Very often, you can make a beginner agent learn faster by making the training experience cleaner. The first method is to simplify the environment where possible. If the task contains extra distractions, large state spaces, or unnecessary actions, the agent has more combinations to explore. Reducing that complexity lets it find useful behavior sooner.
Another practical method is to give the agent better feedback density. As discussed earlier, sparse rewards can slow learning because success is too rare. Small, sensible shaping rewards can shorten the path to useful behavior. For example, rewarding progress toward a goal can help, as long as the reward still supports the real objective and does not create loopholes.
You can also improve speed by using smarter training schedules. Train for a fixed number of episodes, evaluate, and stop early if performance has clearly stabilized at an acceptable level. This prevents wasting time on runs that are no longer improving. Likewise, if metrics are flat and poor for a long time, stop and revise the setup instead of hoping for a miracle.
Efficiency also comes from better observation. Save metrics every episode, and inspect sample episodes at intervals. This helps you catch failure modes earlier. If the agent is repeatedly getting stuck in the same pattern, fixing that issue quickly is much faster than waiting through a long training run.
Finally, use reproducible experiments when possible. Set a random seed for testing so you can compare changes fairly. This does not remove randomness completely, but it reduces confusion and makes your experiments easier to repeat.
Fast learning is not just about speed. It is about reaching useful, stable behavior with less wasted effort. For a beginner project, the smartest optimization is usually clearer design, not more complicated code.
One of the hardest judgment calls in reinforcement learning is deciding when to stop improving the agent. Beginners often continue tuning forever because the agent is not perfect. But in real projects, perfect is rarely the target. The better question is whether the agent is good enough for the goal you set.
Start by defining what success means in concrete terms. That could be reaching the goal in at least 90% of evaluation episodes, finishing within a certain number of steps, or maintaining stable performance across different random starts. Without a target, you cannot know whether more tuning is necessary or just habit.
Good enough also means reliable, not just occasionally impressive. If your agent has one excellent episode and five poor ones, it is not ready. You want performance that holds across repeated evaluation runs. This is why average success rate and average steps matter more than isolated highlights.
There is also a cost-benefit decision. Suppose your agent already solves the task 92% of the time, and another week of tuning might push it to 95%. For a tiny educational project, that extra improvement may not be worth the added complexity. If the next gains require making the code harder to understand, harder to debug, or harder to explain, stopping may be the smarter engineering choice.
A practical stopping checklist is helpful. Ask these questions: Does the agent meet the task goal consistently? Are the metrics stable over time? Does the learned behavior match the intended behavior? Can you explain why it works? If the answer is yes to those questions, your agent is probably good enough for this stage.
This mindset matters because reinforcement learning is not only about maximizing numbers. It is about building systems you can reason about, evaluate, and trust. A simple agent that performs reliably and teaches you how learning works is often more valuable than a slightly stronger agent that becomes difficult to understand. For your first reinforcement learning AI, good enough means measurable progress, sensible behavior, and a setup you can confidently explain.
1. According to the chapter, what is the best first response when an agent learns too slowly or behaves inconsistently?
2. Which approach does the chapter recommend when trying to improve results?
3. Why does the chapter stress measuring progress with numbers instead of guesses?
4. What is one of the four practical areas the chapter suggests focusing on to improve an agent?
5. If results are poor, what does the chapter say is usually true?
You have reached the most satisfying part of this course: turning the ideas of reinforcement learning into a complete small project that you can actually show to someone else. Up to this point, you have learned the core pieces in simple terms: an agent makes choices, an environment responds, and rewards push learning in better directions over time. Now the goal is to connect those parts into one beginner-friendly project from start to finish. This is where reinforcement learning stops feeling like a collection of terms and starts feeling like a practical workflow.
A good final project for a first reinforcement learning course is not large or flashy. It is small enough to understand, fast enough to train, and clear enough to explain. The best beginner projects often use a tiny grid world, a short path-finding task, or a simple game with a few legal actions. In each case, the value comes from watching the agent improve through trial and error. That improvement is the story you will share: what the agent was trying to do, how you designed the rewards, how training changed behavior, and how you checked whether the learned behavior was actually better.
In this chapter, you will build that story in a structured way. First, you will choose a final mini project with a clear goal. Then you will define the environment rules and shape rewards carefully so the agent has a useful signal to learn from. Next, you will train and test the final learning agent, paying attention to the difference between learning progress and actual performance. After that, you will evaluate the result and think about how to demonstrate it clearly to others. Finally, you will practice explaining what your AI learned in plain language and identify realistic next steps beyond this course.
As you work through this chapter, remember an important piece of engineering judgment: simple projects are not inferior projects. In fact, simplicity is often what makes a reinforcement learning system understandable. A tiny environment lets you inspect state transitions, reward choices, and policy behavior directly. That means you can debug mistakes, spot reward problems, and explain outcomes with confidence. If a project is too complex, beginners often cannot tell whether poor results come from the algorithm, the reward design, the training time, or the environment itself.
By the end of this chapter, you should be able to say: “I built a small reinforcement learning agent, trained it on a defined task, tested its behavior, and can explain why it improved.” That is a meaningful milestone. It means you can move from reading simple reinforcement learning code to shaping your own experiments and evaluating them like a careful builder rather than a passive observer.
Think of this chapter as your first complete reinforcement learning delivery. It is not just about getting code to run. It is about making decisions, checking results, and communicating what happened. Those are the habits that turn a toy experiment into a real learning project, even at a beginner level.
Practice note for Build a complete beginner project from idea to result: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Train and test your final learning agent: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Your final mini project should be something you can finish, understand, and share. For a first reinforcement learning project, that usually means choosing a tiny environment with a small number of states and actions. A grid navigation task is one of the best options. For example, imagine a 5-by-5 board where the agent starts in one corner and must reach a goal square while avoiding a few penalty squares. The actions are simple: up, down, left, and right. The rewards are also easy to understand: a positive reward for reaching the goal, a negative reward for stepping into a bad square, and perhaps a small step cost so the agent prefers shorter paths.
The reason this kind of project works well is that every part of the system is visible. You can describe the environment in one sentence, list the allowed actions, and explain success in ordinary language. That matters because a project is only shareable if another person can understand it quickly. If your project needs too much setup or too many special rules, the learning story gets buried under details.
When choosing your project, use three practical tests. First, can the environment be described clearly? Second, can the agent receive feedback often enough to learn? Third, can you tell whether performance improved? If the answer to any of these is no, simplify the task. Many beginners choose a project that is too ambitious, such as a large game or a continuous control problem. Those can be exciting later, but for now they make it hard to separate algorithm behavior from project complexity.
A strong mini project often has a visible before-and-after difference. At the beginning, the agent wanders randomly or behaves inefficiently. After training, it reaches the goal more reliably and wastes fewer steps. That visible improvement is what makes the project satisfying. It also gives you a clean result to demonstrate. If someone asks what your AI does, you can show a short run before training and a short run after training and let the contrast speak for itself.
Good engineering judgment here means choosing a project that teaches the lesson you want. If your goal is to understand trial-and-error learning, pick a project where repeated episodes matter. If your goal is to understand reward design, choose a project where different reward settings create noticeably different behavior. Your final project does not need to be impressive in size. It needs to be honest, understandable, and complete.
Once you have chosen the mini project, the next step is to define the task so the agent can actually learn. This means being precise about the goal, the environment rules, the available actions, and the reward system. In reinforcement learning, unclear rules produce unclear behavior. The agent is not reading your intention. It is only responding to the signals you build into the environment.
Start by writing the goal in one plain sentence. For example: “The agent should reach the target square in as few moves as possible while avoiding trap squares.” That one sentence already implies several design choices. The agent needs a positive reward for success, some negative reward for traps, and a reason not to wander forever. A common pattern is to assign a large positive reward for reaching the goal, a moderate negative reward for stepping into a trap, and a small negative reward for each step. That small step cost is useful because it encourages efficient paths instead of endless movement.
You also need to define episode-ending conditions. Does the episode stop when the goal is reached? When a trap is hit? After a maximum number of steps? In beginner projects, adding a maximum step limit is a smart choice because it prevents training from stalling on episodes where the agent keeps moving without progress. These practical boundaries make training more stable and easier to analyze.
Reward design requires judgment. If the goal reward is too small, the agent may not care enough about success. If the step penalty is too strong, the agent may learn overly cautious or strange behavior. If trap penalties are weak, the agent may cut through dangerous spaces too often. Beginners sometimes assume reward settings are just small details. In reality, rewards define what “good behavior” means. A badly designed reward system can produce behavior that technically maximizes reward but does not match your real intent.
A helpful habit is to think through a few example paths manually. Ask yourself: if the agent takes a short safe path, what total reward does it earn? What about a long wandering path? What about a risky shortcut? This rough arithmetic often reveals reward mistakes before you even train. It also helps you explain later why the agent preferred one behavior over another. Good reinforcement learning projects are easier to debug when the designer has already thought through likely outcomes instead of treating rewards as random numbers.
Training is where the project comes alive. Your agent now interacts with the environment across many episodes, collecting rewards and updating its estimates about which actions are useful in different states. In a beginner project, this often means using a simple table-based method such as Q-learning. The code may be short, but the process reflects the central idea of reinforcement learning: the agent improves through repeated trial and error rather than being told the exact correct action in every situation.
During training, it is important to allow exploration. If the agent always picks the current best-known action, it may never discover a better path. That is why many beginner agents use an exploration rule such as epsilon-greedy behavior, where the agent usually chooses the best-known action but sometimes picks a random action. Early in training, more exploration is often useful. Later, you may reduce exploration so the agent can make better use of what it has learned.
Track more than one signal while training. Total reward per episode is useful, but it is not enough by itself. Also monitor whether the agent reaches the goal, how many steps each episode takes, and whether learning is becoming more stable over time. A project is easier to understand when you can say, for example, “At first the agent almost never reached the goal, but after 500 episodes it succeeded most of the time and used fewer steps.” That is a much clearer training story than simply saying the code ran.
Expect imperfections. Some runs may look noisy. Performance may improve, then dip, then improve again. That does not always mean something is broken. Exploration causes randomness, and small environments can still produce uneven learning curves. The key is to look for overall trends rather than demanding smooth progress at every step. If the agent never improves at all, inspect your environment transitions, reward values, update rule, and episode endings. Many beginner bugs come from simple issues such as not resetting the environment correctly, updating the wrong state, or forgetting to stop an episode when the goal is reached.
When training is complete, do not immediately assume the learned policy is good. Training includes randomness and experimentation. To understand what the agent actually learned, you need to test it separately with reduced or zero exploration. That distinction between training behavior and testing behavior is part of practical reinforcement learning engineering, and it matters if you want your final result to be trustworthy.
Evaluation means asking a simple but important question: did the agent truly learn something useful? To answer that, run test episodes separately from training. In testing, remove or sharply reduce exploration so the agent mostly follows its learned policy. Then measure outcomes in a consistent way. For a small project, useful evaluation metrics include goal success rate, average steps to reach the goal, total reward per test episode, and the number of times traps or penalty states are visited.
A beginner-friendly evaluation does not need advanced statistics, but it should be disciplined. Test the agent over multiple runs, not just one lucky episode. If the environment has random starting positions, include that variation in your tests. If the project uses a fixed start, at least run enough episodes to make sure behavior is repeatable. The point is to separate true learned behavior from chance.
Demonstration is the shareable part of the project. The easiest way to present your work is as a before-and-after comparison. Show what the agent does before training, when it acts randomly or inefficiently. Then show what it does after training. If possible, print the path it takes through the grid, display the total reward, and mention whether it reached the goal. You do not need fancy graphics. A clear text-based or simple plotted demonstration is enough if it shows improvement plainly.
Good engineering judgment also means being honest about limitations. Maybe the agent succeeds in the small environment but struggles if you add more traps. Maybe it learns a safe route but not always the shortest route. That is not failure. It is useful evaluation. Reinforcement learning work becomes more credible when you describe both strengths and boundaries.
A common mistake is confusing “the code executed” with “the project succeeded.” A successful project is one where the training setup, test method, and final demonstration all support the same conclusion. For example, if your test success rate is high and the demonstrated path is efficient, you can reasonably say the agent learned a useful policy for this environment. That is exactly the kind of complete beginner project you can show to classmates, friends, or employers as evidence that you understand the full reinforcement learning workflow.
Building a small reinforcement learning project is only part of the achievement. The other part is being able to explain it clearly. If you can describe your agent in plain language, you prove that you understand more than just the code. A strong explanation starts with the task itself: what the environment was, what actions the agent could take, and what counted as success. Then explain the learning signal: which rewards encouraged good decisions and which penalties discouraged bad ones.
For example, you might say: “I built an agent that learned to move through a grid to reach a goal. It could move up, down, left, or right. It earned a positive reward for reaching the goal, lost reward for stepping into trap cells, and had a small penalty each step so it would prefer shorter routes. Over many episodes, it learned which actions produced better long-term outcomes.” That description is simple, correct, and understandable even to someone new to reinforcement learning.
It also helps to explain what the agent did not learn. In most beginner projects, the agent does not “understand” the world like a human. It does not reason with language or intentions. It learns patterns in action and reward. Saying this clearly prevents overclaiming. It also shows maturity. Reinforcement learning is powerful, but in your first project the achievement is learning a useful behavior policy, not creating general intelligence.
When sharing your project, include practical details: how many episodes you trained for, whether you used exploration during training, and how you tested the final policy. Mention one challenge you faced, such as reward tuning or unstable performance early in training. That gives your explanation realism. It turns the project into a genuine learning journey rather than a polished black box.
A useful communication pattern is problem, method, result, lesson. Problem: what task the agent needed to solve. Method: how the environment, actions, and rewards were defined. Result: what changed after training. Lesson: what you learned about reinforcement learning from the project. This structure works well in presentations, portfolio descriptions, and short write-ups. It helps other people understand your AI, and it helps you organize your own thinking about what you built.
Finishing a small shareable project gives you a strong foundation, but it is also the starting line for deeper reinforcement learning work. The next step is not to jump immediately into the most complex methods. Instead, build outward in controlled ways. One useful path is to modify your project. Add more obstacles, vary the starting state, change the reward design, or compare two learning settings. Small controlled changes teach you far more than abandoning the project and starting something enormous.
Another good next step is to compare algorithms. If you used Q-learning, try a related method and see how behavior changes. You can also experiment with hyperparameters such as learning rate, discount factor, and exploration level. This builds the practical skill of tuning an agent rather than treating training as magic. Reinforcement learning often depends as much on careful setup and evaluation as on the algorithm name itself.
You may also want to move from table-based learning to function approximation later, especially when state spaces become too large for a simple table. That opens the door to deep reinforcement learning. But the lesson from this course should stay with you: larger models do not replace clear thinking. Even in more advanced systems, you still need a well-defined environment, sensible rewards, proper evaluation, and an honest explanation of results.
If you want portfolio-worthy next projects, keep them focused. Examples include a slightly larger navigation world, a resource collection game, or a tiny balancing task in a standard RL environment. In each case, preserve the habits you built here: define the task clearly, train and test separately, measure outcomes, and communicate what the agent learned. Those habits scale much better than rushing into complexity.
Most importantly, keep your beginner mindset in the best sense: curious, concrete, and willing to inspect details. Reinforcement learning becomes easier to understand when you treat each project as an experiment. You propose a setup, observe behavior, refine rewards, and evaluate the results. That is exactly what you have done in this chapter. You have not just studied reinforcement learning concepts. You have completed a full small project from idea to result, and that is the right way to grow into the next level.
1. What makes a good first reinforcement learning project in this chapter?
2. Why does the chapter recommend defining rewards carefully?
3. What is the difference between training and testing the final agent?
4. Why can a very simple environment be helpful for beginners?
5. How should you present your finished project according to the chapter?