AI Research & Academic Skills — Beginner
Turn a simple question into a mini AI study you can clearly explain.
This course is a short, book-style guide for absolute beginners who want to move from “I’m curious about AI” to “I ran a small study and I can defend what I did.” You don’t need coding, advanced math, or research experience. You’ll learn a practical, repeatable way to test one small claim about an AI tool or AI-assisted workflow and report what happened in clear, everyday language.
Instead of drowning you in jargon, we start from first principles: what an experiment is, why you need a comparison (a baseline), and how to keep your process fair. By the end, you’ll have a mini study you can share at work, in school, or in a personal portfolio—plus a template you can reuse anytime you want to test another idea.
You will create a small, complete “mini study package.” It includes a testable question, a simple plan, a tiny dataset (often 20–50 examples), a results table, one clear chart, and a one-page report. The goal is not to prove a universal truth. The goal is to produce honest, understandable evidence about a specific question in a specific setting.
We begin by defining what counts as an experiment and how a mini study differs from casual “trying things.” Next, you turn your curiosity into a testable question and write a plan before you start—so you don’t accidentally change your rules halfway through. Then you build a small dataset and learn how to keep it trustworthy with simple documentation and privacy-safe habits.
After that, you run your experiment without coding by using consistent procedures and careful logging. You’ll evaluate results with beginner-friendly measures (counts, averages, time saved) and learn how to avoid common traps like cherry-picking or overclaiming. Finally, you’ll write and present your mini study so a non-technical reader can understand what you did and why your conclusions are reasonable.
This course is for anyone who needs to make decisions about AI with evidence: students, educators, analysts, program staff, product teams, and curious individuals. It’s also a fit if you want to talk about AI more confidently—because you’ll learn the difference between an impressive demo and a fair test.
If you want a guided, beginner-safe path, you can begin right away and build your mini study step by step. Register free to access the course, or browse all courses to find related learning paths.
When you finish, you won’t just “know about AI.” You’ll know how to ask a clear question, test it in a small but honest way, and explain your results with confidence—using a method you can repeat for your next idea.
Learning Scientist and AI Research Methods Instructor
Sofia Chen designs beginner-friendly research training for teams and universities. She specializes in turning complex AI and evaluation ideas into simple, repeatable workflows that produce clear, defensible results.
Beginners often start an AI project with the right energy—curiosity—but the wrong framing: “Let’s try AI and see what happens.” This course treats your curiosity as a starting point, then turns it into a testable mini study with results you can trust. An AI experiment is not “using an AI tool.” It is a structured comparison that tests a claim, under constraints, with defined success criteria.
In this chapter you will build the mindset and the simplest workflow that you’ll reuse throughout the course. You will choose a small real-world problem you care about (Milestone 1), learn to separate “trying AI” from “testing a claim” (Milestone 2), draft a one-sentence study goal and audience (Milestone 3), decide what you will and will not measure (Milestone 4), and finish by creating a personal checklist you’ll follow in later chapters (Milestone 5).
Think of your mini study as a tiny bridge between everyday questions and evidence. The bridge doesn’t need advanced math or code. It needs clear decisions: what you’re testing, what you’re comparing against, what counts as success, and what data you will use. When those decisions are explicit, your results become interpretable—and repeatable.
By the end of Chapter 1 you should have one topic you can safely study, a clear claim you could test, and a plan for what evidence would change your mind.
Practice note for Milestone 1: Choose a small, real-world problem you care about: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Separate “trying AI” from “testing a claim”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Draft a one-sentence study goal and audience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Decide what you will and will not measure: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Create your personal study checklist for the course: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Choose a small, real-world problem you care about: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Separate “trying AI” from “testing a claim”: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Draft a one-sentence study goal and audience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Curiosity sounds like: “Can AI help me sort emails?” Evidence sounds like: “Compared with my current method, an AI-assisted method reduces sorting time without increasing mistakes.” The shift from curiosity to evidence is the core skill in this course. Curiosity is open-ended; evidence is bounded by a claim you can test.
Start with Milestone 1: choose a small, real-world problem you care about. The best beginner topics are tasks you already do and can observe: summarizing short articles, categorizing feedback, drafting polite replies, extracting key fields from simple forms, or identifying duplicates in a list. If you can’t describe the task in one sentence, it’s probably too big.
Then apply Milestone 2: separate “trying AI” from “testing a claim.” Trying AI is exploratory and can be useful, but it produces weak conclusions because the goalposts move. Testing a claim forces you to decide what would count as “better” before you see results. A practical rule: if you can’t say what result would disappoint you, you are not running an experiment yet.
Your goal in this section is to treat curiosity as a hypothesis candidate, not as the study itself. Evidence requires pre-commitment: a task, a metric, and a comparison.
In plain language, an experiment is a fair test where you change one thing on purpose and observe what happens, while keeping the rest as consistent as you can. For mini AI studies, the “one thing” is usually the method: a baseline method versus an AI-assisted method. The “what happens” is measured by outcomes you define up front.
Milestone 3 is where this becomes concrete: draft a one-sentence study goal and audience. A useful template is: “For audience, does AI approach improve outcome on task compared with baseline?” Example: “For a student writing weekly reading reflections, does an AI summarizer improve clarity and reduce drafting time compared with writing unaided?”
Milestone 4: decide what you will and will not measure. Beginners often measure vague satisfaction (“it felt good”). Instead, pick observable outcomes: time to complete, number of correct labels, number of missing fields, or a simple quality score rubric you can apply consistently. Also explicitly list what you won’t measure (for example, “I will not claim the summaries are factually perfect; I will measure only whether they capture the top 3 points from the source text”). This is not weakness—it is good scientific boundaries.
If you can name these elements in your own words, you are doing experimental thinking—even without code.
AI is best treated like a new tool in your workflow, not a mysterious brain. Tools have strengths, failure modes, and operating instructions. When you frame AI as a tool, you naturally ask the right questions: What input does it need? What output does it produce? How stable are results if you repeat the same prompt? What kinds of errors does it make?
This mindset prevents a common beginner mistake: confusing a good-looking output with a verified result. A fluent paragraph can still be wrong, inconsistent, or inappropriate for your audience. Your mini study should therefore focus on the task outcome, not the “wow factor.” For example, if the task is classification, the output should be judged by correct categories. If the task is drafting, judge by a rubric (clarity, completeness, tone), not by how “human” it sounds.
Engineering judgment matters here. Decide early what role AI plays: generator, editor, classifier, extractor, or comparator. Then decide where humans must stay in the loop. For beginners, the safest and most realistic claims are about assisting humans, not replacing them: “AI helps me draft faster,” “AI helps me spot missing fields,” or “AI helps me generate options I then choose from.”
When AI is treated as a tool, your experiment becomes about whether the tool improves a defined job—under the same conditions you would actually use it.
A baseline is what you would do without the AI approach. It is not an “anti-AI” position—it is the reference point that makes your results meaningful. Without a baseline, you can’t tell whether the AI approach is better, worse, or simply different. This is why the course outcome includes “Run a basic comparison (baseline vs. AI approach) without coding.” A comparison is the smallest unit of evidence.
Good baselines are simple and realistic. If you currently do the task manually, manual work is your baseline. If you already use templates, then “template-only” can be your baseline. If you already use search, then “search + manual synthesis” can be your baseline. The key is fairness: your baseline should be something a reasonable person would actually use, not an intentionally weak strawman.
In practice, your mini study can look like this: pick 10–30 items (emails, short texts, feedback comments). Do the baseline method on all items, record time and quality. Then do the AI-assisted method on the same items, with the same time budget and the same rubric, record time and quality. If you worry about learning effects (you get faster on the second pass), split items into two sets and alternate: half baseline first, half AI first.
Once you have a baseline, your results can be summarized as deltas: “AI saved 30% time but increased minor errors,” or “AI improved rubric score by 1 point on average.” That is what “counts” as an experiment: a measured comparison.
“Mini” is not a downgrade; it is a design choice. Constraints force clarity. A beginner-friendly AI experiment should fit into a predictable time window (often 2–6 hours total across planning, data prep, running, and writing). If your idea requires weeks of data collection, specialized annotation, or complicated tooling, it is not mini anymore.
Scope is the number one reason beginner studies fail. People try to test too many things at once: multiple prompts, multiple models, multiple datasets, multiple metrics, multiple audiences. The result is confusing evidence. Your engineering judgment here is to reduce degrees of freedom: one task, one dataset version, one baseline, one AI-assisted method, and a small set of measures.
Use Milestone 4 explicitly: decide what you will and will not measure. A mini study often measures only two things: (1) efficiency (time) and (2) quality (a simple rubric or accuracy). Everything else becomes discussion, not a headline claim. Also write down non-goals: “I will not generalize beyond my dataset,” “I will not claim the model is unbiased,” “I will not claim causality beyond this workflow comparison.” These statements protect you from overstating results.
Mini means you can finish, explain, and repeat the study. If you can’t repeat it next week the same way, your design is too fragile.
Safety and doability are not optional in beginner research—they are part of good method. A safe topic is one where you can collect or create data without violating privacy, confidentiality, or platform rules, and where harm from mistakes is low. A doable topic is one where you can access data, define labels, and evaluate outcomes with your current skills.
A practical way to stay safe is to use one of these data sources: your own non-sensitive writing (notes you are comfortable sharing), public-domain or openly licensed text, synthetic examples you generate yourself, or small datasets you can fully anonymize. Avoid using private messages, client data, medical or legal records, student records, or anything that could identify a real person. Even if you remove names, context can re-identify people. When in doubt, do not use it.
Now apply Milestone 5: create your personal study checklist for the course. Your checklist should include at least: the problem statement, audience, baseline method, AI method, dataset source, privacy check, success metric, and a stop rule (when you’ll stop collecting data or tweaking prompts). This checklist is your guardrail against scope creep and accidental rule-breaking.
When you pick a topic that is safe and doable, you set yourself up for the rest of the course outcomes: a tiny dataset you can document, a fair baseline comparison, and results you can report responsibly.
1. Which description best matches what this chapter calls an AI experiment?
2. What is the key difference between “trying AI” and “testing a claim” in this chapter?
3. Why does the chapter emphasize making decisions about what you will and will not measure?
4. Which statement best reflects the chapter’s stance on what your mini study is trying to show?
5. By the end of Chapter 1, which set of outcomes best matches what you should have?
Most beginner AI studies fail for a simple reason: the “idea” is interesting, but the question is not testable. In this chapter you will turn curiosity into a question you can answer with a small, safe dataset, then create a plan you can execute without coding. Your goal is not to build the best model in the world; your goal is to run a fair comparison (a baseline vs. an AI-assisted approach), record what you did, and end with results you can defend.
Think of your mini study as a short recipe. The recipe needs: (1) a clear question, (2) a simple hypothesis, (3) defined inputs and outputs, (4) a comparison setup you can actually run, and (5) a methods paragraph written before you start so you don’t “move the goalposts” later. You will also state risks and assumptions upfront so readers know your limits.
As you work through the milestones, keep the scope tiny. Ten to fifty examples is often enough for a beginner mini study, as long as you are consistent. Keep your data non-sensitive, document where it came from, and choose evaluation measures that match your question. The best plan is the one you can finish this week.
Practice note for Milestone 1: Turn your topic into a testable question: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Write a beginner-friendly hypothesis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Define inputs, outputs, and what you will record: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Pick a comparison setup (A vs. B) you can run: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Pre-write your “methods” paragraph before you start: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Turn your topic into a testable question: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Write a beginner-friendly hypothesis: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Define inputs, outputs, and what you will record: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Pick a comparison setup (A vs. B) you can run: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Milestone 1 is turning a topic into a testable question. “Can AI help students write better?” is too broad. A testable question is specific (what task, what content, what model/tool), measurable (what metric decides success), and time-bounded (over what period or within what constraints). This structure forces you to make engineering judgments early, before you collect data.
A practical template is: In [context], does [AI approach] improve [measurable outcome] compared to [baseline], on [dataset] within [time/constraints]? Example: “In short customer support emails, does an AI rewrite prompt reduce grammar errors compared to no rewrite, on 30 publicly available emails, in a single session using the same model version?” That question already implies what you will record and how you will judge success.
Common mistakes include hiding multiple questions inside one (“accuracy and tone and speed”), choosing a metric you can’t reliably measure, or letting the dataset drift (“I added a few more examples that looked interesting”). Keep one primary outcome. If you care about secondary outcomes (like time to complete), list them as secondary so they do not overtake the study.
By the end of this section you should have one sentence that a friend could use to run the same test tomorrow.
Milestone 2 is writing a beginner-friendly hypothesis. A hypothesis is not a wish; it is your best guess about what will happen, plus a short reason grounded in how the method works. Keep it in plain language, and tie it directly to your metric from Section 2.1. You are allowed to be wrong—being wrong is still a result if the study was fair.
Use a simple format: We expect A to beat B on metric M because R. Example: “We expect AI-assisted rewriting to reduce grammar errors (M) compared to the original text (B) because the model has been trained on large amounts of edited English and can apply common corrections quickly (R).” If you are comparing two prompts, explain why prompt design should matter (e.g., providing examples, constraints, or a rubric).
Also state what would surprise you. This helps you avoid rationalizing after the fact. For instance: “If the AI rewrite increases errors, we suspect it is over-editing domain terms or changing meaning.” That statement will later guide your error analysis.
Common mistakes: making the hypothesis vague (“AI will be better”), stacking multiple claims (“faster and more accurate and more polite”), or forgetting the baseline. Your hypothesis must reference the comparison setup; otherwise it cannot be tested. Keep one primary hypothesis and, if needed, one secondary hypothesis that you clearly label.
A clear hypothesis makes your results readable: the reader can immediately see whether the evidence supports what you expected.
Milestone 3 is defining inputs, outputs, and what you will record. Instead of using heavy statistics terms, think in three columns: what you change, what you keep the same, and what you measure. This is the heart of a fair mini study.
What changes is your “approach” (Condition A vs. Condition B). Examples: no AI vs. AI; Prompt 1 vs. Prompt 2; model X vs. model Y; before vs. after a checklist. Change only one main thing at a time. If you change the model and the prompt and the dataset, you won’t know what caused any difference.
What stays fixed includes your dataset, evaluation rubric, time window, and any constraints (e.g., “one attempt only” or “no external browsing”). Fixing these reduces hidden bias. A frequent beginner error is letting the AI have extra advantages (multiple retries) while the baseline gets one try. Decide upfront whether both sides get one attempt or both get the same number of retries.
What you measure should map to your question: accuracy, error count, completeness, readability score, or a rubric rating. If humans rate outputs, define the rubric in writing and keep the raters blind to which condition produced which output when possible. Even in a small study, a simple 1–5 rubric with clear anchors (“1 = wrong meaning, 5 = correct and complete”) beats an informal opinion.
This section sets you up to collect a tiny, safe dataset with a documented source and to avoid accidental rule changes halfway through.
Milestone 4 is choosing a comparison setup you can run. You are not trying to prove AI is “good” in general; you are testing whether a specific AI-assisted method beats a baseline on a defined task. Pick a pattern that matches your real situation and your available time.
A/B (baseline vs. AI approach) is the simplest and most common. Condition A is your baseline (e.g., original text, manual rule, existing template). Condition B is the AI approach (e.g., a rewrite prompt, an extraction prompt). Run both on the same items, then compare metrics item-by-item. This reduces noise because each item serves as its own reference.
Before/after is useful when the “intervention” is a process change, such as adding a checklist to your prompt or adding examples. You run “before” with the old prompt, then “after” with the improved prompt, on the same dataset. Be careful: if you edit your dataset between runs, you are no longer testing the prompt change.
Human vs. AI can work without coding if the task is small and you define time limits. Example: a human labels 30 items in 20 minutes; AI labels the same items; you compare accuracy against a reference answer key. Fairness matters: if the human is an expert and the AI is generic, state that. If the human can look things up but the AI cannot browse, state that too.
Common mistakes include comparing outputs from different datasets, changing evaluation rules midstream, or judging with “vibes” instead of a rubric. Your design should make it hard to cheat accidentally. If you can, randomize the order of items so fatigue does not bias one condition.
This section is also where you confirm you can run the whole study without coding: spreadsheets, copy/paste, and a consistent rubric are enough for a first mini study.
Good results require good notes. If you cannot explain exactly how an output was produced, you cannot trust your comparison. In a beginner mini study, logging is your “lab notebook.” It also protects you from unconscious cherry-picking because you can show what you tried and when.
At minimum, log the exact prompt text for each condition, including any examples, formatting instructions, and constraints. If you used a chat tool, copy the full prompt into a document or spreadsheet. Also log model/tool name and version/date (models change), plus any visible settings like temperature, top-p, or “creative vs. precise” toggles. If settings are not visible, state that they were default.
Log dataset details: where each item came from (URL, book page, your own writing), when you collected it, and why it is safe to use (public domain, synthetic, or your own). If you cleaned or edited items, record what you changed and why. For each item, store the input, the baseline output, the AI output, and your score.
This logging directly supports Milestone 5, because your “methods” paragraph should be a readable summary of what you logged. If you log well, writing becomes easy.
Milestone 5 is pre-writing your methods paragraph before you start, and that includes stating risks and assumptions. This is not bureaucracy; it is how you keep your study honest and safe. Your methods paragraph should describe your question, dataset size/source, conditions A and B, evaluation metric, and how you handled ambiguous cases. Writing it first prevents you from changing the plan after seeing results.
State assumptions such as: “We assume the reference answers are correct,” “We assume our 1–5 rubric captures quality,” or “We assume the dataset represents typical inputs.” Then state risks and how you reduce them. Common risks include sensitive data exposure (avoid by using synthetic or public data), unfair comparisons (avoid by equal attempts and same dataset), and tool drift (avoid by recording date/model version).
Also name validity limits. A 30-item dataset cannot prove a universal claim. Say what your results do and do not generalize to: “This applies to short emails in English, not long legal documents.” If humans rate outputs, mention rater bias and whether you blinded the condition. If you used the AI to help create the dataset, disclose that because it can inflate performance.
When you finish this section, you should have a ready-to-run mini study plan and a draft methods paragraph you can paste into your final report. That is the moment your curiosity becomes a real, testable study.
1. What is the main reason many beginner AI studies fail, according to Chapter 2?
2. Which plan best matches the chapter’s goal for a beginner mini AI study?
3. Why does Chapter 2 recommend writing the methods paragraph before you start the study?
4. What is an appropriate scope for a beginner mini study in this chapter?
5. Which set of elements most closely matches the chapter’s “recipe” for a mini study?
Your mini study will only be as credible as the data you build. That might sound intimidating, but you do not need a “big data” pipeline to do real research. You need a small, clearly defined dataset (20–50 examples) that matches your question, a consistent way to judge each example, and a simple record of what you did. This chapter turns “I found some examples” into “I can defend where these examples came from and how they were handled.”
In a beginner mini study, your dataset is usually a simple table: one row per example and a few columns describing it (text, label, rating, count, source, date). You will make several small, practical decisions: what type of data you need, how to sample without cherry-picking, how to label consistently, how to clean without losing track of changes, and how to write a short data sheet so someone else can understand (and trust) your results.
Keep your target in mind: later chapters will compare a baseline method to an AI-assisted method. If your dataset is messy, biased toward easy cases, or undocumented, you can still get numbers—but they will be hard to interpret. The goal here is not perfection; it is transparency and repeatability at a small scale.
Practice note for Milestone 1: Choose your data type (text, labels, ratings, counts): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Collect or create 20–50 examples the right way: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Make a simple labeling guide and test it on 5 items: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Clean your dataset and record changes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Create a data sheet you can share with your report: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Choose your data type (text, labels, ratings, counts): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Collect or create 20–50 examples the right way: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Make a simple labeling guide and test it on 5 items: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Clean your dataset and record changes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
In a mini study, “data” means the concrete evidence you will apply a method to and then evaluate. That evidence can take different forms, and choosing the right type is your first milestone. The type must match your research question and what “success” means. If your question is about classification (e.g., “Can an AI label customer messages by topic?”), you need text and labels. If your question is about quality (e.g., “Does AI make summaries more readable?”), you may need ratings (1–5) from you or a small group. If your question is about frequency (e.g., “Does AI extract more key facts?”), you may use counts (how many facts were found).
Most beginner datasets can be stored in a spreadsheet with columns like: ID, Input (text/image description), Gold Label (your best-known answer), AI Output (later), Baseline Output (later), Notes, Source, and Date Collected. Notice that “gold label” does not mean perfect truth; it means your defined reference according to a rubric. If you cannot define what “correct” means, you do not have a label yet—you have an opinion. Turning opinions into a rubric is a major part of making a dataset trustworthy.
Engineering judgment matters here: keep each example independent (one message, one review, one claim) and similar in scope. Mixing very short and very long items can blur whether the method is good or the task is just easier. Also, avoid changing the task halfway: if your label set is “billing / technical / other,” do not later decide you want “shipping” too unless you are willing to relabel everything and document that change.
Once you know your data type, your second milestone is collecting or creating 20–50 examples the right way. The biggest beginner mistake is sampling only what is easy to find, easy to label, or matches what you hope the AI will do well on. That creates a “too-clean” dataset and inflates your results. Sampling is the practical skill of choosing examples so they represent what you actually care about.
Start by writing your population in one sentence: “All customer support messages received by our club during the last month” or “All short news paragraphs about local events.” Then define your sampling window (dates, sources, categories). If you have access to a larger pool, use a simple method like: take every 10th message, or use a random number generator to pick 30 items from a list. If you do not have a pool, create one systematically: for instance, gather 10 items from each of three sources rather than 30 from one source. The point is not statistical perfection; it is resisting cherry-picking.
Practical outcome: by the end of this milestone you should have a dataset draft with 20–50 rows, each row traceable to where it came from. If you can’t explain why those exact items were chosen, pause and improve your sampling rule before you proceed. This is the fastest way to protect your study from accidental bias.
Your third milestone is building a simple labeling guide (a rubric) and testing it on five items. Labels are the “answers” you will compare methods against, and beginners often underestimate how hard it is to label consistently—even when the task feels obvious. A rubric turns “I know it when I see it” into rules that someone else could follow.
A good labeling guide has four parts: (1) the label list (e.g., Positive/Neutral/Negative), (2) definitions for each label, (3) decision rules for edge cases, and (4) examples (2–3 per label). Keep it short—one page is enough. If you are using ratings, define what each score means (e.g., 1 = confusing, 3 = okay, 5 = clear and complete) and what to do when you are uncertain.
Now test the guide on five items before labeling everything. Label those five items, wait a short time, and label them again (or have a friend label them) to see if the rubric produces stable decisions. If you change your mind often, the rubric needs clearer rules. Common fixes include adding a “Mixed/Unclear” label, specifying what to do when two topics appear, or defining a threshold (“If more than half the message is about billing, label Billing”).
Practical outcome: you should end this milestone with a labeling guide file and a small “pilot” set of five labeled items. Document what you changed after the pilot. That small step prevents you from discovering, 40 rows later, that your label meanings were drifting day by day.
Your fourth milestone is cleaning your dataset and recording changes. Cleaning does not mean forcing the data to look nice; it means making sure each row is usable and comparable. Beginners sometimes “clean away” the very difficulty they want to study. For example, removing typos might make the AI look better (or worse) depending on the task. Decide what cleaning is allowed based on your question, and keep the raw version.
Run a few simple quality checks you can do in a spreadsheet:
Also check for “label noise”: scan a few examples per label and ask, “Do these really belong together?” If one label contains many borderline cases, your rubric may be too broad. Another practical check is to compute basic counts per label (how many items in each category). If one label has 28 items and another has 2, your evaluation may be unstable; you may need to rebalance by collecting a few more examples in underrepresented categories (while documenting how you did it).
Practical outcome: produce a cleaned dataset file and a change log describing what was altered (e.g., “standardized label spelling,” “removed 2 duplicates,” “fixed 3 broken rows”). You should be able to explain every change as either correcting an error or enforcing a rule you wrote down.
Trustworthy datasets are not only accurate; they are also safe and ethical. Privacy issues can appear even in tiny studies. Personal data includes names, emails, phone numbers, addresses, usernames, student IDs, faces, voices, and sometimes combinations of details that identify someone. If your examples come from real people (messages, reviews, screenshots), assume personal data might be present unless you have verified otherwise.
Apply three beginner-friendly rules. First, collect the minimum: do you need the person’s name to answer your question? Usually not. Replace identifiers with placeholders (e.g., “NAME,” “EMAIL”) or remove them entirely. Second, get consent when required: if you are using private messages, classroom work, or anything not clearly public, get permission from the owner or use a public, permitted dataset instead. Third, store safely: keep files in a private folder, avoid posting raw data in public documents, and be careful with cloud sharing links.
If you plan to use AI tools during your study, treat tool prompts as data sharing. Do not paste sensitive text into a system you do not control unless you have permission and you understand the tool’s data handling policy. A safer pattern is to anonymize first, then use the anonymized text for any AI-assisted steps.
Practical outcome: add a “privacy status” note to your dataset (e.g., “Anonymized,” “Public data,” “Consent obtained”) and make sure your report can describe privacy protections without revealing identities.
Your fifth milestone is creating a simple data sheet you can share with your report. Documentation is what makes a tiny dataset trustworthy to a reader who wasn’t there when you collected it. Think of this as the dataset’s “nutrition label”: it explains what is inside, where it came from, and how it changed.
At minimum, document: dataset name, purpose (one sentence), source (URLs, folders, or “created by author”), collection dates, inclusion/exclusion rules (what you left out and why), label definitions, number of items, and known limitations (e.g., “mostly English,” “mostly short messages,” “few neutral cases”). Add a short “intended use” line: what decisions it should and should not support.
Use simple version control even if you never touch Git. Save files with clear names like: dataset_raw_v1.csv, dataset_clean_v2.csv, and labels_guide_v2.docx. Keep a small change log in a text file or spreadsheet tab with columns: Date, File, Change, Reason, Person. This prevents silent edits that make results impossible to reproduce.
Practical outcome: when you write your mini study report, you will be able to point to a single data sheet that explains the dataset in plain language. Readers can understand your evidence, and you can re-run your own study later without guessing which file was the “final” one.
1. What makes a beginner mini study dataset credible according to Chapter 3?
2. Which approach best supports avoiding cherry-picking when building your tiny dataset?
3. Why does Chapter 3 recommend making a simple labeling guide and testing it on 5 items?
4. What is the main reason to record changes while cleaning your dataset?
5. Which dataset format best matches the chapter’s recommended structure for a beginner mini study?
This chapter is where your mini study becomes “real.” You will run a baseline method and an AI method on the same small set of examples, using a repeatable procedure, and you will capture results in a way that another person (or future-you) can audit. The goal is not to build a perfect system. The goal is to produce a defensible comparison that answers your research question with minimal tools: a spreadsheet, consistent settings, and careful logging.
If you feel tempted to “just try a few prompts until it looks good,” pause. That impulse is exactly why studies become hard to trust. Instead, you’ll work like a cautious experimenter: lock your inputs, run both methods under controlled conditions, record everything you did, and only rerun when you have a documented reason.
By the end of this chapter, you should have: (1) a baseline method you can execute consistently (non-AI or simple AI), (2) an AI method with fixed settings, (3) outputs for both methods on the same examples, (4) a results table that can be checked line-by-line, and (5) a short list of obvious errors you corrected with minimal reruns. That is enough to support a beginner-friendly mini report later.
We’ll now focus on the practical “how” of running the experiment without writing code. Think of this as turning your idea into a small, disciplined production line: inputs go in, procedures run, outputs come out, and the record of what happened is as important as the outputs themselves.
Practice note for Milestone 1: Set up your baseline method (non-AI or simple AI): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Set up your AI method with consistent settings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Run both methods on the same examples: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Log everything in a results table you can audit: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Spot obvious errors and rerun only what’s necessary: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Set up your baseline method (non-AI or simple AI): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Set up your AI method with consistent settings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Consistency is the hidden backbone of a fair comparison. When beginners compare a baseline to an AI approach, the most common mistake is changing multiple things at once: different examples, different instructions, different time spent, or even different evaluation standards. If anything besides the method changes, you can’t be sure what caused the difference.
Start by “freezing” three things: your dataset (the exact examples), your evaluation rule (what counts as correct or good), and your run procedure (the steps you’ll follow each time). Then you can change only one thing: the method (baseline vs. AI). This is the practical meaning of controlling variables in a no-code study.
Milestone 1 (baseline) fits here: decide on a baseline you can apply with equal effort across all rows. A baseline might be a simple rule (“if the text contains ‘refund’ label as Complaint”), a keyword list, a template, or even a non-AI manual approach that is time-boxed (e.g., “Spend 30 seconds per item using a fixed checklist”). The baseline should be clearly weaker than a good AI method but reliable, repeatable, and not secretly tuned per example.
Milestone 2 (AI method) also belongs here: use consistent settings. If you use a chat interface, note the model name/version, temperature (or creativity setting), and any system instructions. Do not switch models mid-run. If you must change a setting, treat it as a new experimental condition, not a silent tweak.
A prompt is not “one clever message.” In a study, a prompt is a procedure: stable instructions plus a consistent way to insert the example. If you can’t repeat it exactly, you can’t fairly compare results or rerun failures.
Create a prompt template with three parts: (1) role and task, (2) required output format, and (3) the example. Keep it short and explicit. For example, if you are doing classification, force a single label. If you need extraction, force a fixed set of fields. The more freedom you allow, the harder evaluation becomes.
Milestone 2 becomes actionable when you write your “Run Card”: a short checklist you follow for every example. A run card might include: open the same chat thread (or start a fresh thread—choose one and stick to it), paste the template, paste the example, submit, copy the output exactly, and paste it into the spreadsheet without editing. Editing model output is a silent form of cheating, even when you mean well.
Milestone 3 (run both methods on the same examples) is where repeatability matters most. Alternate runs in a predictable pattern so you don’t accidentally give one method “better” examples. One simple approach is to run all baseline outputs first (fast, mechanical), then run AI outputs in the same order. Another approach is row-by-row: baseline then AI for row 1, baseline then AI for row 2, and so on. Pick one approach, document it, and follow it.
Common prompting mistakes include: changing wording when you feel uncertain, adding extra hints for hard rows, or letting the model output multiple answers and choosing the best. If you want to allow retries, define the retry rule before running (e.g., “If output is not in the allowed labels, rerun once with the same prompt.”) and apply it to both methods if relevant.
When you already know which outputs came from the AI, it is easy to judge them more generously (or more harshly). This is not a moral failure; it is human. A simple version of blinding can reduce that bias without complex tools.
Here are practical no-code blinding options, from easiest to stronger:
Blinding matters most when evaluation involves judgment: readability, helpfulness, whether an answer “counts,” or whether an extraction is “close enough.” Even with objective labels, blinding helps you avoid “explaining away” errors for your preferred method.
To keep scoring consistent, define your rubric in one or two sentences. Example: “Correct if the label exactly matches the gold label.” Or: “Correct if the extracted email is present and valid; ignore extra text.” If you need partial credit, define it explicitly (e.g., 0, 0.5, 1) and use it for both methods.
Engineering judgment tip: if you notice yourself debating a score for more than a few seconds, your rubric is too vague. Stop, clarify the rule, document the rule change, and then apply it retroactively to everything already scored. Do not apply a new rule only to future rows; that creates inconsistent evaluation.
Real runs fail. You might get timeouts, empty responses, refusal messages, formatting that ignores your constraints, or outputs that are clearly unrelated. Failure handling is part of the experiment design because it affects results and fairness.
First, define what counts as a “failure” for your task. Typical categories include:
Then define a minimal rerun policy (Milestone 5): rerun only when the failure is operational, not when the output is merely “bad.” For example, you can rerun a timeout once. You can rerun invalid format once using the same prompt. But if the model returns a valid label that happens to be wrong, that is not a reason to rerun—otherwise you are selecting for favorable outcomes.
In your spreadsheet, log failures explicitly rather than leaving blanks. Use codes such as: TIMEOUT, INVALID_FORMAT, REFUSAL. This matters because missing data can quietly distort your summary metrics. For instance, if you only compute accuracy on rows where the AI answered, you are overestimating performance unless you also report coverage (how often it answered).
Practical approach: maintain two columns per method—Raw Output and Parsed Output. Raw Output is exactly what you received (copy/paste). Parsed Output is what you used for scoring (e.g., the single label). If parsing fails, Parsed Output becomes a failure code. This keeps your audit trail intact and makes reruns targeted: you can filter to TIMEOUT rows and rerun only those.
Finally, avoid “silent fixes.” If you must intervene (e.g., remove extra punctuation to match a label), document a rule (“strip whitespace and punctuation”) and apply it consistently to all outputs before scoring.
Your spreadsheet is your experiment engine. A clean results table lets you compute metrics, check errors, and prove that you did not cherry-pick. Build it as if someone skeptical will review it row-by-row—because that is how you learn good research habits.
Create one main sheet named Results_v1 with one row per example. Use stable IDs so you can reorder without losing traceability. A practical minimal schema:
This table is Milestone 4 in concrete form. It also helps you compute simple evaluation measures without coding. For accuracy-style tasks, you can calculate: Accuracy = AVERAGE(AI_Score) and Baseline Accuracy = AVERAGE(Baseline_Score). If you have failures, also compute coverage: Coverage = count of non-failure parsed outputs / total rows. For extraction tasks, you can define a per-row score such as “all required fields correct = 1 else 0,” which stays beginner-friendly.
Engineering judgment tip: keep the table narrow enough that you will actually use it. Beginners often create dozens of columns and then stop updating them. Start minimal, then add one column only when you can explain how it supports your research question or auditability.
Common table mistakes: overwriting raw outputs, mixing versions of the prompt/model in the same column, or correcting labels in-place without recording the original. Treat your spreadsheet like a lab instrument: it should preserve what happened, not what you wish had happened.
Your lab notebook is where your study becomes credible. In a beginner mini study, the notebook can be a simple document (or a spreadsheet tab) named Lab_Notes_v1. The key is not elegance; it is completeness. You are recording decisions and exceptions so that your future self can reproduce the run and so readers can trust your results.
Record entries with timestamps (even just the date) and use a consistent structure:
This is also where you document your milestone progress: when you finalized the baseline method, when you locked the AI settings, when you started the official run, and when you performed limited reruns under the failure policy. If you discover an error in your own process (e.g., you pasted the wrong input into a row), note it, fix it, and flag that row as “operator error corrected.” Hiding such issues is worse than having them, because hidden issues can invalidate conclusions.
Strong practical habit: separate “exploration” from “official run.” If you tried five prompts to learn what works, that’s normal—write it down as exploration. Then pick one prompt as the official procedure, copy it into the notebook, and run the full dataset once. If you later change the prompt, label it as a new version (Prompt_v2) and treat it as a new condition. This prevents accidental p-hacking-by-prompting.
When you later write your mini report, your notebook will supply the methods section: exactly what you did, with enough detail that a beginner classmate could reproduce it. That is the real outcome of this chapter: not just numbers, but a trustworthy path from question to results.
1. What is the main goal of running the experiment in Chapter 4?
2. Why does the chapter warn against “trying a few prompts until it looks good”?
3. Which practice best ensures a fair comparison between the baseline method and the AI method?
4. What does it mean for the results table to be “auditable”?
5. When should you rerun parts of the experiment according to the chapter?
By now, you have a clear question, a small dataset, and two approaches to compare (a baseline and an AI-assisted method). This chapter is where many beginner studies either become convincing or fall apart. The difference is rarely fancy math—it’s choosing measures that match your goal, applying them consistently, and resisting the temptation to “interpret” results before you’ve actually evaluated them.
Evaluation is not about proving your idea is right. It is about checking, with evidence, whether your AI approach improves something you care about. The work in this chapter aligns with five practical milestones: (1) pick 1–2 evaluation measures that match your goal, (2) calculate results using simple counts and averages, (3) make one chart that tells the story at a glance, (4) write a conclusion that follows the evidence, and (5) list limitations and alternative explanations so your reader understands what the results do and do not mean.
Keep your mindset steady: your goal is a small, honest study with a clear comparison. A “good” result can be “no meaningful difference,” if you can show it cleanly and explain why. That is still learning.
Practice note for Milestone 1: Pick 1–2 evaluation measures that match your goal: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Calculate results with simple counts and averages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Make one clear chart that tells the main story: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Write your conclusion using evidence, not vibes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: List limitations and alternative explanations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Pick 1–2 evaluation measures that match your goal: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Calculate results with simple counts and averages: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Make one clear chart that tells the main story: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Write your conclusion using evidence, not vibes: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: List limitations and alternative explanations: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Evaluation means judging performance against your goal using a rule you can apply repeatedly. Beginners often jump straight to “Is it accurate?” but accuracy is only one kind of goal. If your project is “summaries that help classmates study,” the most important outcome might be usefulness (clarity, completeness, readability), not exact word-for-word correctness.
Start by naming the decision you want to make. For example: “Should I use the AI approach instead of the baseline for this task?” Then choose a measure that captures the benefit. This is Milestone 1: pick 1–2 measures that match your goal. If you pick too many metrics, you can accidentally justify any conclusion. Keep it tight.
A practical way to decide is to ask: what would success look like for a real user? If your goal is factual correctness (e.g., classifying emails as spam/not spam), use an accuracy-style measure. If your goal is speed or convenience (e.g., drafting polite replies), measure time saved or user rating. If your goal is both, choose one primary measure (your “headline”) and one secondary measure (your “safety check”), such as “percent correct” plus “time per item.”
Engineering judgment matters here: you are not trying to measure everything, you are choosing what matters most for the decision you care about.
You can evaluate many mini studies with three beginner-friendly measures: percent correct, average rating, and time saved. The key is to define them in a way that you can compute with simple counts and averages (Milestone 2). Keep a small spreadsheet with one row per example and columns for baseline output, AI output, and evaluation notes.
Percent correct works when each item has a clear “right” answer (or a label you treat as ground truth). Compute it as: (number correct / total items) × 100. If your dataset has 20 items and the AI gets 16 correct, that’s 80%.
Average rating works when outputs are subjective but still comparable. Create a simple rubric such as 1–5 where 1 = not useful, 3 = okay, 5 = very useful. Rate baseline and AI for each item using the same rubric. Then compute the average. To reduce bias, write your rubric before you look at results, and include short rating notes like “missing key point” or “too long.”
Time saved is often the most convincing measure for productivity tasks. Decide what “time” means (minutes to complete the task, minutes to revise, total minutes end-to-end). Then measure consistently. If you cannot time precisely, use rough but consistent timing (e.g., phone stopwatch). Report average time per item and the difference between methods.
These measures are simple, but they are legitimate when your study is small and your rules are clear.
A fair comparison is the heart of your mini study: baseline (A) versus AI approach (B). “Fair” means both methods face the same items under the same rules, and you score them using the same evaluation process. This is where many projects accidentally cheat without meaning to.
Use the same dataset split. If you created 30 examples, do not evaluate baseline on one set and AI on another. Put both outputs side-by-side for each item. If you are using prompts, lock the prompt template before evaluating. If you revise prompts mid-way, you have changed the method; record it as a new version and don’t mix results.
Define your baseline clearly. A baseline might be: “no AI, student writes from scratch,” or “a simple rule-based method,” or “copy/paste first sentence.” Your AI approach must be described just as clearly (model name, prompt, settings, constraints). Then apply both methods item-by-item. This prevents moving goalposts later.
For subjective ratings, fairness also means consistent judging. If possible, rate outputs blind: hide which column is baseline vs AI by labeling them Output 1 and Output 2 while you score. If you cannot blind yourself, at least write your rubric first and stick to it.
This section connects directly to Milestones 2 and 3: clean calculations are only meaningful when the comparison is fair, and your eventual chart should reflect that fair pairing.
Small studies are noisy. That does not make them useless—it just means you must describe uncertainty in plain terms. If you test only 10 items, one or two “weird” items can swing your percent correct by 10–20 points. Your job is to notice that and communicate it honestly.
A practical approach: report both the overall metric and the counts behind it. Instead of only “80% correct,” write “16/20 correct (80%).” This lets readers see how much data you actually had. For ratings, include the number of items and consider showing the range (min and max) alongside the average.
Also watch for item difficulty. If half your examples are easy, improvements can look bigger than they are. When possible, include a mix of difficulties and note any clusters (e.g., “AI struggled on ambiguous questions”). You can also compare performance across simple categories: short vs long items, or easy vs hard, as long as you define the categories before analyzing.
Milestone 3 (one clear chart) helps here: a simple bar chart of baseline vs AI averages can be paired with counts, or you can use a paired dot plot where each item is a dot connected by a line from baseline to AI. The visual pattern often reveals uncertainty: if some items improve and others get worse, your “average gain” may be fragile.
Uncertainty is not a failure; it is information about how confident you should be.
Most evaluation mistakes come from three traps: cherry-picking, moving goalposts, and leakage. Learning to spot them is part of becoming a careful researcher, even in a tiny beginner study.
Cherry-picking happens when you only show the best examples or only report the metric that makes your approach look good. The fix is simple: evaluate the full dataset you defined up front, and keep your primary metric consistent. If you want to include “best and worst” qualitative examples, include both, and explain why you selected them (e.g., representative cases, not only wins).
Moving goalposts happens when you redefine success after seeing results. For instance, you planned to measure percent correct, but after the AI performs poorly you switch to “helpfulness” without saying so. The fix: write your success definition and rubric before evaluation (Milestone 1) and stick to it. If you do change it, label the change as an exploratory follow-up, not the original test.
Leakage is when information from the answers sneaks into the inputs, making performance look better than it really is. In beginner projects, leakage often appears as: including the correct label in the prompt, using examples that the AI already saw in the conversation, or copying the answer key into notes the AI can access. Prevent it by separating “inputs the method gets” from “answers used only for scoring.” Keep your dataset in a fixed table, and do not feed ground-truth fields into the model.
These traps can invalidate a study even when the math is correct. Avoiding them is part of evaluation quality.
Once you have metrics and a chart, your final job is interpretation: writing a conclusion that follows evidence, not vibes (Milestone 4). A responsible conclusion answers your research question, references your evaluation measures, and stays within what your small study can support.
A strong beginner conclusion has three parts: (1) the main result in one sentence, (2) the evidence, and (3) the boundary of the claim. Example structure: “On this dataset of 20 items, the AI approach improved average usefulness ratings from 3.1 to 3.8 while keeping percent correct roughly similar (baseline 15/20, AI 16/20). This suggests the AI method may help for drafting, but the sample is small and results varied by item type.”
Milestone 3 (one clear chart) should match your conclusion. If your chart shows only a tiny difference, your conclusion should not claim a big breakthrough. If you found trade-offs (faster but less accurate), say so plainly and tie it to your success definition.
Finally, list limitations and alternative explanations (Milestone 5). Limitations might include small sample size, subjective ratings by a single rater, narrow topic coverage, or a prompt that may not generalize. Alternative explanations might include: the AI benefited from clearer instructions than the baseline, the baseline was unusually weak, or the dataset items were easier than real-world inputs. Writing these does not weaken your work; it makes it trustworthy.
With a careful conclusion, a simple chart, and honest limitations, your mini study becomes a real piece of evidence—not just an opinion with numbers.
1. What is the main purpose of evaluation in this chapter’s approach?
2. Which evaluation strategy best matches the chapter’s guidance for beginner studies?
3. According to the chapter, what kind of calculations are typically enough for this stage of evaluation?
4. What is the recommended way to present your results visually?
5. Which conclusion best follows the chapter’s guidance about interpreting results?
You did the hard part: you turned curiosity into a small, testable study. Now you have to make it understandable. A mini study is only useful if someone else can quickly grasp (1) what you tried, (2) what you compared, (3) what you found, and (4) how confident they should be. In this chapter you’ll complete five practical milestones: drafting a one-page mini paper, creating a repeatable methods box, adding one figure and one table, preparing a two-minute explanation for non-technical listeners, and planning the next steps.
Think of writing and presenting as part of the research itself. Your report forces you to define terms, justify choices, and surface hidden assumptions. If you can’t explain your setup in plain language, you likely can’t evaluate it fairly. The goal is not to sound “academic,” but to be precise, honest, and easy to follow.
As you work, keep one principle in mind: readers should never have to guess what you did. Every key decision (dataset source, labeling rule, baseline, evaluation metric, and what counted as success) should appear somewhere in your paper, visuals, or methods box. When you make your study repeatable, you also make it credible.
Practice note for Milestone 1: Draft a one-page mini paper (title to references): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Create a methods box anyone could repeat: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add one figure and one table with clear captions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Prepare a 2-minute explanation for a non-technical audience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 5: Plan next steps: improve data, design, or evaluation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 1: Draft a one-page mini paper (title to references): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 2: Create a methods box anyone could repeat: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 3: Add one figure and one table with clear captions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Milestone 4: Prepare a 2-minute explanation for a non-technical audience: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Milestone 1 is a one-page mini paper. One page is a feature, not a limitation: it forces prioritization. Use a simple, familiar structure so readers instantly know where to look. A practical template is: Title, Abstract, Introduction (1 short paragraph), Methods, Results, Discussion/Limitations, References.
Title: include the task, data source type, and comparison. Example: “Baseline vs. AI-Assisted Sentiment Labeling on 50 Public Product Reviews.” A good title prevents misunderstandings before they start.
Abstract (3–5 sentences): state the question, what you compared (baseline vs AI approach), what data you used (tiny dataset, where it came from), what metric you used, and the headline result. Avoid background history; make it a mini “executive summary.”
Methods: this is where most beginners under-write. Include: dataset size, inclusion/exclusion rules, how labels were created, what baseline did, what the AI approach did, and how you evaluated. If you did “no code” evaluation, say so and describe the manual steps.
Results: report the numbers and point readers to your one figure and one table (Milestone 3). Do not hide weak performance—clarity beats marketing. Provide at least one comparison that supports your claim (e.g., accuracy improved from 62% to 74% on your test set, or errors decreased on a specific category).
Discussion: interpret cautiously: why might the AI approach be better (or worse)? What errors were common? What would you change next? This section is also where you note limitations: small sample size, potential label noise, narrow domain, and any prompt sensitivity.
Finish with 2–5 references: dataset source links, model/tool documentation, and any rubric/metric definitions you relied on. A tiny reference list is still a signal of care.
Clear writing is engineering judgment applied to language. Your reader can’t see your screen, your prompts, or your spreadsheet unless you describe them. Start by defining your terms the first time they appear. If you say “baseline,” specify it (e.g., “baseline = always predict the most frequent class” or “baseline = simple keyword rule”). If you say “AI approach,” name the tool and configuration (“ChatGPT, prompted to label each review as Positive/Neutral/Negative”).
Overclaiming is the most common credibility failure in beginner studies. Your dataset is tiny by design, so your conclusions must be appropriately scoped. Good claims are narrow and testable: “On this dataset of 50 reviews from X source, the AI-assisted labeling matched the human reference labels more often than the keyword baseline.” Bad claims are broad and vague: “This AI model is accurate” or “AI understands sentiment.”
Use uncertainty language without being evasive. Phrases like “suggests,” “in this small sample,” and “we observed” are not weaknesses; they are accurate. Also, avoid causal claims when you only ran a comparison. You can say “performed better,” not “caused improvement in real-world outcomes.”
Finally, separate observation from interpretation. “The AI approach mislabeled 6/12 sarcastic items” is an observation. “The model struggles with sarcasm due to lack of context” is an interpretation—label it as a hypothesis for next steps.
Milestone 3 is to add one figure and one table. The purpose is not decoration—it’s compression. A good visual lets a reader understand your key result in five seconds. Choose visuals that match your study type and remain readable at small sizes.
Your figure: pick one message and show it clearly. Common beginner-friendly options include: a bar chart comparing baseline vs AI metric (accuracy/F1), a small confusion matrix heatmap (if you have three classes), or an error breakdown chart (e.g., errors by category: sarcasm, long text, mixed sentiment). If you can’t chart, a clean diagram works: “Pipeline: Data → Baseline → AI Approach → Evaluation.”
Your table: tables are best for exact numbers and reproducibility details. A useful table is a “Results Table” with rows for each method and columns for metrics (accuracy, precision, recall, number of items). Another strong table is an “Example Table” with 3–5 representative items: the input text, reference label, baseline output, AI output, and a note about why it was difficult.
Milestone 2 (methods box) connects tightly with visuals: your figure shows the “what,” your methods box explains the “how.” When readers can trace a number in the chart back to a specific evaluation step, you’ve earned trust.
Even small studies require ethical habits. Transparency is not optional; it is part of correct reporting. Readers need to know what data you used, whether anyone could be harmed, and where AI may have introduced risks like privacy leakage or biased outputs.
Disclose your data source and permissions. State whether the dataset was public, synthetic, or collected by you. If you used public text, include the link and the date accessed. If you created data, explain the creation process. If there is any possibility the data contains personal information, state how you handled it (e.g., removed names, avoided private messages, used only content you are allowed to reuse).
Disclose AI tool details. Name the model/tool, version (if available), and key settings that could change results (temperature, top-p, system instructions, or any guardrails). If you used a hosted system, note that outputs may change over time.
Disclose human involvement. Who labeled the “reference” answers? Was it you alone? Did you double-check? If labels are subjective, say so and define the rubric. If you used AI to help label data, disclose that too—otherwise readers will assume a human gold standard.
Ethical writing also means being honest about limitations and failure cases. Reporting errors is not embarrassing; it’s how others learn when the approach is safe to use and when it is not.
Milestone 2 is your “methods box anyone could repeat,” and Milestone 5 is planning improvements—both depend on reproducibility. If someone cannot rerun your mini study, they cannot verify it or build on it. Your goal is “repeatable enough,” not industrial-grade engineering.
Create a bordered box (or clearly labeled subsection) titled “Methods Box (Reproducible Summary).” Keep it short but complete. Include the essentials:
Practical tip: include a tiny “Run Log” table with date, tool version (if available), and any changes you made between runs. Many “mysterious” differences come from silent changes in prompts, sampling, or data filtering.
Common reproducibility mistakes include: changing the dataset mid-way, evaluating on the same examples you used to refine prompts, or leaving out exclusion criteria (which makes your sample look better than it is). If you tuned your prompt after seeing errors, say so and label the result as “iterated” rather than “one-shot.”
Your mini study becomes valuable when it travels well: a hiring manager, classmate, or stakeholder can understand it quickly and trust its boundaries. Milestone 4 is preparing a two-minute explanation for a non-technical audience. Use this structure: (1) the problem in everyday terms, (2) what you compared, (3) what you measured, (4) what you found, (5) what you would do next. Avoid model names unless asked; focus on the decision you’re enabling.
Example talk track: “I tested whether an AI assistant can label review sentiment better than a simple keyword rule on 50 public reviews. I measured agreement with a human reference label using accuracy. The AI approach improved accuracy by 12 points but still failed on sarcasm. Next I’d expand the dataset and add clearer labeling rules for mixed sentiment.” This is understandable, specific, and appropriately cautious.
To make it a portfolio piece, package three artifacts: the one-page mini paper (Milestone 1), the methods box (Milestone 2), and a single slide with your figure and table (Milestone 3). Add a short README explaining how to reproduce the run and what “success” meant.
Milestone 5 is planning next steps. Choose one improvement path based on your biggest bottleneck:
If you want to turn the mini study into a proposal, add a paragraph on impact (“who benefits and how”), resources (time, tools, data access), and risk controls (privacy, bias checks, safe usage constraints). A small, well-documented study is often more persuasive than a large, vague one—because it shows you can think clearly, measure honestly, and communicate results so others can act on them.
1. According to Chapter 6, a mini study is only useful if someone else can quickly grasp which set of four things?
2. Why does the chapter frame writing and presenting as part of the research itself?
3. What is the primary communication goal of the mini study report in Chapter 6?
4. Which principle should guide your writing so readers don’t have to guess what you did?
5. How does making your study repeatable affect its credibility, according to Chapter 6?