HELP

+40 722 606 166

messenger@eduailast.com

IT Support to AI Ops: Monitor LLM Logs, Incidents & Costs

Career Transitions Into AI — Beginner

IT Support to AI Ops: Monitor LLM Logs, Incidents & Costs

IT Support to AI Ops: Monitor LLM Logs, Incidents & Costs

Go from help desk to AI ops by mastering LLM monitoring, incidents, and cost.

Beginner ai-operations · llm-observability · incident-management · log-analysis

Transition from IT Support to AI Operations

This course is a short technical book disguised as a practical course for people with IT support experience who want to move into AI operations (AI Ops) and production reliability work for LLM-powered applications. You’ll learn the day-to-day operational skills that teams need once a chatbot, agent, or LLM feature ships: logging, monitoring, alerting, incident response, and cost control.

Unlike generic “build an LLM app” training, this course focuses on what happens after launch—when users report failures, latency spikes, model providers rate-limit you, retrieval breaks, and costs creep beyond expectations. You’ll learn to treat LLM apps like real services: measurable SLIs/SLOs, clear runbooks, and budgets that leadership can trust.

What you’ll build (as deliverables)

By the end, you’ll have a complete blueprint for operating an LLM application in production. You’ll be able to create:

  • A logging schema for LLM requests (with correlation IDs and safe redaction)
  • Dashboards and alert rules for reliability, token usage, and quality proxies
  • Incident triage checklists and mitigation playbooks for common LLM failure modes
  • A cost budget model with guardrails, anomaly detection, and attribution
  • A lightweight governance plan for prompt/model changes and approvals

How the 6 chapters progress

Chapter 1 reframes your existing IT support instincts—triage, communication, ownership—and maps them directly to AI ops. You’ll learn what’s different in LLM systems and what success looks like in production (SLOs, error budgets, and cost as a first-class metric).

Chapter 2 gets concrete with logging: what to capture, how to avoid leaking sensitive data, and how to make logs useful for rapid troubleshooting. You’ll learn the minimum set of fields that make “I can’t reproduce it” turn into “here is the exact request path and failure point.”

Chapter 3 expands from logs to full observability. You’ll define metrics for latency and errors, but also the LLM-specific metrics that drive real incidents: token spikes, rate limits, and pipeline bottlenecks across retrieval, tools, and model calls. You’ll also learn alerting patterns that reduce noise while catching user-impacting issues early.

Chapter 4 turns signals into action. You’ll learn incident response tailored to LLM apps, including how to recognize provider outages vs prompt regressions vs retrieval failures, and how to apply mitigations like fallbacks, throttling, caching, and model switching. You’ll also practice postmortem structure that produces real improvements.

Chapter 5 brings FinOps discipline into AI operations. You’ll learn how to forecast and manage token costs, implement guardrails, detect anomalies, and attribute spend to teams and features. This chapter is key for career transitions, because many organizations now evaluate AI reliability and AI cost control together.

Chapter 6 ties everything together into production operations and career readiness: change management for prompts and model releases, access control and audit trails, KPI reporting, and a practical 30/60/90-day plan to move into an AI ops role. You’ll leave with portfolio-ready artifacts you can show in interviews.

Who this is for

  • Help desk, service desk, and IT support professionals ready to move into AI operations
  • Junior SREs or sysadmins asked to support an LLM feature in production
  • Career switchers who want a concrete, operations-first path into AI teams

Get started

If you want to operate LLM apps reliably—while keeping incidents calm and costs predictable—this is your playbook. Register free to start learning, or browse all courses to compare learning paths.

What You Will Learn

  • Map IT support workflows to AI operations responsibilities for LLM apps
  • Instrument LLM apps with structured logs, traces, and metrics for troubleshooting
  • Build dashboards and alerts for latency, errors, token usage, and quality signals
  • Run incident triage, escalation, and postmortems tailored to LLM failure modes
  • Detect prompt, retrieval, and model drift issues using operational signals
  • Set and manage token/cost budgets with guardrails, forecasts, and chargeback tags
  • Create actionable runbooks and SLIs/SLOs for AI services in production
  • Communicate incidents and cost risks clearly to stakeholders

Requirements

  • Basic IT support or help desk experience (or equivalent troubleshooting mindset)
  • Comfort using a terminal and reading JSON logs
  • Basic understanding of HTTP APIs (requests, status codes, latency)
  • Optional: familiarity with cloud monitoring tools (any vendor) is helpful but not required

Chapter 1: From IT Support to AI Operations (What Changes, What Transfers)

  • Identify transferable IT support skills and gaps for AI ops
  • Define the LLM app stack and its operational ownership boundaries
  • Set initial SLAs/SLIs for an LLM-powered feature
  • Create a starter on-call checklist and escalation map
  • Draft an AI ops readiness scorecard for a team

Chapter 2: Logging for LLM Apps (Signals You’ll Actually Need)

  • Design a log schema for prompts, tools, and retrieval without leaking secrets
  • Implement correlation IDs across requests and model calls
  • Build a basic query playbook for common user-reported issues
  • Establish log retention and redaction rules for compliance

Chapter 3: Observability: Metrics, Traces, Dashboards, and Alerts

  • Choose core metrics for reliability, quality proxies, and spend
  • Create a dashboard that answers the top 10 on-call questions
  • Set alert thresholds that reduce noise and catch real incidents
  • Validate alerting with synthetic checks and canaries
  • Document SLO error budgets tied to user impact

Chapter 4: Incident Response for LLM Systems (Triage to Postmortem)

  • Run LLM incident triage using a structured decision tree
  • Classify incidents by failure mode and pick the right mitigation
  • Write a clear incident update and stakeholder timeline
  • Complete a blameless postmortem with measurable follow-ups
  • Turn one incident into a hardened runbook and new monitors

Chapter 5: Cost Budgets and FinOps for LLM Apps (Keep Spend Predictable)

  • Model token-based costs and set monthly budgets per environment
  • Implement cost guardrails: limits, quotas, and feature flags
  • Detect cost anomalies and attribute spend to teams and features
  • Create a cost optimization backlog with ROI estimates
  • Publish a weekly cost report that drives action

Chapter 6: Operating in Production (Playbooks, Governance, and Career Moves)

  • Assemble a production-ready AI ops playbook and KPI set
  • Set governance for changes: prompt/model releases and approvals
  • Create a 30/60/90-day transition plan from IT support to AI ops
  • Build a portfolio artifact: dashboards, runbooks, and postmortem sample
  • Prepare for interviews with AI ops scenarios and metrics stories

Sofia Chen

AI Operations Engineer (LLM Observability & FinOps)

Sofia Chen is an AI Operations Engineer who builds monitoring and incident response programs for production LLM applications. She has supported platform teams across cloud, SRE, and FinOps initiatives, translating IT support skills into reliable AI services. Her teaching focuses on practical runbooks, measurable SLAs/SLOs, and cost-aware operations.

Chapter 1: From IT Support to AI Operations (What Changes, What Transfers)

Moving from IT support into AI operations (AI ops) is less of a leap than it looks. The job is still about restoring service, protecting users, and reducing repeat incidents. What changes is the “system” you operate: an LLM-powered feature is probabilistic, cost-sensitive, and safety-sensitive. Traditional IT incidents usually have deterministic root causes (a service is down, a dependency is slow, a certificate expired). LLM incidents can look like “it works, but it’s wrong,” “it works, but it’s risky,” or “it works, but it’s too expensive.”

This chapter maps familiar IT workflows to the responsibilities you’ll own in AI ops: instrumentation, dashboards, incident triage, escalation, and postmortems—tailored to LLM failure modes. You’ll also build the habit of defining service boundaries and ownership early, so your on-call rotation doesn’t become “everyone owns everything.” Finally, you’ll leave with practical artifacts: initial SLAs/SLIs, a starter on-call checklist and escalation map, and an AI ops readiness scorecard you can bring to your team.

As you read, keep one key mindset: in AI ops, you don’t only ask “Is the service up?” You ask “Is it producing safe, useful outputs within latency and cost limits?” The rest of the course will show how to monitor that with structured logs, traces, and metrics—starting here with what changes and what transfers.

Practice note for Identify transferable IT support skills and gaps for AI ops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the LLM app stack and its operational ownership boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set initial SLAs/SLIs for an LLM-powered feature: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a starter on-call checklist and escalation map: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Draft an AI ops readiness scorecard for a team: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Identify transferable IT support skills and gaps for AI ops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Define the LLM app stack and its operational ownership boundaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set initial SLAs/SLIs for an LLM-powered feature: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a starter on-call checklist and escalation map: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: AI ops roles vs help desk, sysadmin, and SRE

AI ops sits at the intersection of help desk habits (ticket hygiene, user empathy, clear communications), sysadmin instincts (dependency mapping, configuration control), and SRE practices (SLIs/SLOs, error budgets, automation). If you’ve worked a queue, you already know the backbone of operations: categorize issues, reproduce quickly, communicate status, and close the loop with prevention.

What’s different is the “shape” of incidents. In help desk, the user says “I can’t log in.” In AI ops, the user says “the assistant gave me a confident but incorrect answer,” or “it exposed internal information,” or “it started timing out after we added a new retrieval source.” These are not just bugs; they are behavior and policy failures. AI ops therefore tends to include responsibilities that resemble product operations and risk management: setting guardrails, monitoring quality signals, and enforcing cost budgets.

Transferable skills you should lean on:

  • Intake and triage discipline: ask for time range, impact, reproduction steps, and environment; in AI ops, also ask for prompt, conversation ID, model version, and retrieval context.
  • Change awareness: correlate incidents with deployments, config toggles, model/provider changes, prompt updates, and data refreshes.
  • Communication: clear updates, ETA honesty, and stakeholder-specific summaries (support, engineering, security, leadership).

Typical gaps to close:

  • Observability depth: you’ll need structured event logging, distributed traces across API/tool calls, and metric design for tokens, cost, and safety.
  • Probabilistic troubleshooting: reproduce using stored prompts and deterministic sampling settings where possible, but accept that some failures are distribution shifts rather than single defects.
  • Boundary-setting: define what your team owns versus the model provider, the vector database team, or the platform team. Without this, escalation turns into blame ping-pong.

A practical outcome for your transition plan: write your “role translation” in one paragraph. Example: “I will run LLM feature reliability by owning observability, on-call triage, and incident coordination; engineering owns code fixes; security owns policy and data classification; vendor management owns provider escalation.” That sentence becomes the seed for runbooks and RACI later in this chapter.

Section 1.2: Anatomy of an LLM application (UI, API, tools, retrieval, model)

You can’t operate what you can’t draw. A useful mental model for LLM apps is a stack with clear boundaries and failure points. Start with five layers: UI, API/service, tools, retrieval, and the model. Most incidents can be localized to one layer plus a dependency (network, auth, rate limits).

UI: web/mobile chat, IDE plugin, or embedded widget. Common ops signals: client-side errors, conversation resets, dropped streaming connections, and latency that users perceive as “hangs.” You often don’t “own” the UI in AI ops, but you need enough telemetry (request IDs, client timing) to separate UI failures from backend issues.

API/service layer: the orchestration service that receives prompts, applies policy, calls retrieval/tools, and sends completions. This is where you should enforce structured logging and correlation IDs. A practical baseline is to log: request_id, user_id (or hashed), model, prompt_template_version, tool_plan, retrieval_top_k, token counts, provider status, and final outcome (success/blocked/fallback).

Tools: function calls (billing lookup, ticket creation, database query) or workflow engines. Tool failures create “weird” user-visible behavior: the model may proceed without the tool result, retry excessively, or produce a misleading answer. Operationally, you treat tools like microservices: per-tool error rate, latency, timeout counts, and permission denials, all tied back to the original request.

Retrieval: vector store, keyword search, reranking, and document pipelines. Retrieval failures manifest as outdated answers, missing citations, or sudden drops in relevance after a data refresh. Ownership boundaries matter: who owns embedding generation, indexing cadence, document ACL enforcement, and vector DB capacity? If you can’t name an owner, you’ve found a future incident.

Model/provider: hosted API or self-hosted model. Provider changes (model version updates, safety policy shifts, rate limits) can cause silent behavior changes. AI ops should track: provider error codes, throttling, latency distribution, and which model version served each request.

Practical exercise: draw your stack on one page and annotate what you can control versus what you can only observe. That boundary becomes your escalation map. Many teams skip this and end up with on-call responders who can’t answer the basic question: “Is this our bug, our data, or the provider?”

Section 1.3: Operational risks unique to LLMs (hallucinations, tool misuse, data leakage)

LLM operations introduces risks that don’t fit classic “up/down” monitoring. Three categories deserve explicit attention from day one: hallucinations (incorrect but plausible outputs), tool misuse (unsafe actions via function calls), and data leakage (exposing sensitive data in prompts, retrieval, or outputs). These are operational problems because they can be detected, triaged, mitigated, and prevented with the same rigor as outages—if you instrument the right signals.

Hallucinations: The service may be available and fast, but user trust collapses if answers are wrong. Operational signals include: spike in user “thumbs down,” increased follow-up questions like “are you sure?”, drop in citation rate, or mismatch between retrieved sources and final answer. A common mistake is treating hallucination as “just model behavior” and not building feedback and sampling pipelines. In practice, AI ops should ensure that low-rated conversations are captured (with privacy controls) and that a triage workflow exists: classify whether the root cause is retrieval gap, prompt regression, tool error, or model change.

Tool misuse: When the model can call tools, you have a new failure mode: correct language paired with incorrect actions. Examples: creating duplicate tickets, querying the wrong account, or attempting privileged operations. You need guardrails such as allowlists, schema validation, and “confirmation required” steps for high-impact tools. Operationally, log every tool call with parameters (redacted as needed), authorization result, and a reason field indicating the model’s stated intent. An incident may be triggered not by errors, but by abnormal patterns (tool call volume spike, unusual parameter shapes, repeated retries).

Data leakage: Leakage can occur through prompts (users paste secrets), retrieval (documents without proper ACLs), or outputs (model repeats sensitive context). This is where AI ops intersects security operations. A practical starting point is to classify data and implement redaction: mask secrets in logs, prevent certain fields from being added to prompts, and tag requests that touched sensitive sources. Alerts should exist for policy violations: detected secrets, access-denied retrieval attempts, or outputs flagged by safety filters.

Engineering judgment: don’t aim for “perfect safety” immediately; aim for detectability and containment. For example, implement safe defaults (deny by default for tools, retrieval scoped by user identity), then add monitoring and playbooks. Teams often over-invest in one-time prompt tweaks and under-invest in ongoing detection—leading to repeated incidents when data or model behavior shifts.

Section 1.4: SLIs/SLOs for LLM apps (latency, success, safety, cost)

Traditional SLAs emphasize uptime and response time. LLM features require a broader set of service level indicators (SLIs) because “bad but fast” is still a failure. Define SLIs first, then set realistic SLO targets that align with user expectations and your budget. Early on, pick a small number you can actually measure reliably.

Core SLIs to establish for an LLM-powered feature:

  • Latency: end-to-end (user click to first token) and completion time. Track p50/p95/p99, and separate provider latency from your orchestration latency.
  • Success rate: percent of requests that produce a completed response (not error, not timeout). Also track “fallback success” if you use alternative models or cached answers.
  • Safety/compliance rate: percent of responses that pass policy checks (PII, prohibited content), plus rate of blocks/deflections. A spike in blocks can be a regression just as much as a spike in unsafe outputs.
  • Cost and token usage: tokens per request, tokens per successful response, and cost per 1k requests. Include separate metrics for prompt, completion, and retrieval context tokens.

Set initial SLOs in tiers. Example for a customer-facing assistant: p95 time-to-first-token < 1.5s, overall success rate > 99%, safety pass rate > 99.9% with clearly defined policy scope, and average cost per conversation < $0.03. Your numbers will differ; the key is to define them explicitly and attach an owner.

Common mistakes:

  • Choosing unmeasurable quality SLIs: “answers must be correct” is not an SLI until you define a proxy (ratings, audits, citation coverage, task completion).
  • Mixing product goals with reliability targets: keep SLIs operational and observable; keep “increase conversions” separate.
  • Ignoring budgets: LLM cost is a reliability dimension. If you exceed budget, you will be forced into emergency throttling that feels like an incident.

Practical outcome: document your first SLI set and a draft SLA statement in plain language. Example: “During business hours, the assistant will respond successfully within 3 seconds for 95% of requests; if degraded, it will fall back to a smaller model while maintaining safety filters.” This gives on-call responders permission to use mitigations that preserve safety and cost.

Section 1.5: On-call foundations: severity levels, paging, comms templates

On-call for LLM apps works best when you treat it as a repeatable process, not heroics. Start with severity levels that reflect user impact and risk. A hallucination that leaks data can be a higher severity than a 2-minute partial outage, depending on your domain.

A practical severity model:

  • SEV-1: security/safety breach (data leakage), widespread outage, or high-impact tool misuse (unauthorized actions). Immediate paging, incident commander assigned, security engaged.
  • SEV-2: major degradation (p95 latency doubled, high error rate, retrieval returning empty for many users), or sustained cost spike that threatens budget.
  • SEV-3: limited impact issues (single tenant, certain prompt path, isolated tool failures) with workaround.
  • SEV-4: minor defects, questions, and monitoring alerts with no user impact.

Paging rules should be tied to SLO burn or clear thresholds: error rate, timeouts, provider throttling, safety violations, and cost anomalies. Avoid “alert fatigue” by ensuring each page requires action. For example, page on sustained p95 latency breach over 10 minutes, not a single spike.

Use communication templates to reduce cognitive load. At minimum, maintain:

  • Status update template: “What’s happening, who’s impacted, when it started, current mitigation, next update time.”
  • Executive summary template: one paragraph focusing on business impact and risk (including data exposure assessment).
  • User-facing note template: simple language, avoids internal blame, offers workaround or expectation.

Starter on-call checklist for LLM incidents:

  • Confirm scope: endpoints, tenants, geos, model version, recent deployments/prompt changes/data refreshes.
  • Check provider status, rate limits, and error codes; compare to internal error metrics.
  • Inspect a few representative request traces: prompt size, retrieval results, tool call sequence, token counts, safety filter outcomes.
  • Apply safe mitigations: enable fallback model, reduce context window, disable risky tools, tighten retrieval scope, or temporarily increase caching.
  • Capture incident artifacts: request IDs, timestamps, dashboard snapshots, and a short timeline for postmortem.

Engineering judgment: default to mitigations that reduce risk first (disable tool actions, enforce stricter policies), then optimize for latency and cost. In LLM ops, “keeping it running” is not success if it’s unsafe.

Section 1.6: Minimal viable runbooks and ownership/RACI

Runbooks are where AI ops becomes scalable. Your goal in week one is not a perfect encyclopedia; it’s a minimal viable runbook that helps a new on-call responder make safe decisions in 10 minutes. Pair that with a clear ownership model (RACI) so incidents don’t stall.

Minimal runbooks to create for an LLM feature:

  • Provider/API failure: how to verify provider outage vs internal issue, when to switch models/regions, how to handle rate limits, and how to contact vendor support.
  • Retrieval degradation: how to detect empty/low-relevance retrieval, validate index freshness, roll back a data pipeline, and enforce ACL checks.
  • Tooling incident: how to disable specific tools, rotate credentials, validate permissions, and replay requests safely.
  • Safety incident: steps to preserve evidence, engage security/legal, increase filtering, and scope exposure (which users, which documents, what timeframe).
  • Cost spike: identify top cost drivers (model, tenant, prompt version), apply token caps, reduce context, and enable caching/cheaper fallback.

Define ownership boundaries explicitly using a lightweight RACI:

  • Responsible: on-call AI ops engineer runs triage, comms, and mitigation.
  • Accountable: LLM feature owner (engineering lead or product/platform owner) approves risky mitigations and owns follow-up work.
  • Consulted: security (data leakage), data/platform team (retrieval/index), vendor manager (provider escalation).
  • Informed: support team, customer success, leadership, and affected product teams.

Draft an AI ops readiness scorecard to assess whether a team is prepared to run an LLM feature in production. Keep it simple and actionable. Example categories (score 0–2 each): observability coverage (logs/traces/metrics), SLI/SLO defined, runbooks exist, on-call rotation staffed, escalation contacts verified, safety controls implemented, cost budgets and chargeback tags configured, and postmortem process in place. A team with a low score should delay launch or reduce scope (disable tools, limit retrieval sources) until basic controls exist.

Common mistake: assigning ownership by component name (“platform owns the model”) rather than by operational outcome (“who is on the hook when answers are wrong or unsafe?”). In AI ops, user impact crosses layers; your RACI should reflect that reality while still giving responders a clear path to action.

Chapter milestones
  • Identify transferable IT support skills and gaps for AI ops
  • Define the LLM app stack and its operational ownership boundaries
  • Set initial SLAs/SLIs for an LLM-powered feature
  • Create a starter on-call checklist and escalation map
  • Draft an AI ops readiness scorecard for a team
Chapter quiz

1. Which statement best captures what changes when moving from traditional IT support to AI operations for an LLM-powered feature?

Show answer
Correct answer: The system is probabilistic, cost-sensitive, and safety-sensitive, so incidents may be “wrong,” “risky,” or “too expensive” even if the service is up
The chapter emphasizes that the core mission remains, but LLM features introduce probabilistic behavior plus cost and safety constraints.

2. Compared to traditional IT incidents, which type of incident is more characteristic of LLM operations?

Show answer
Correct answer: The feature responds, but the output is unsafe or incorrect
LLM incidents can present as outputs that are wrong or risky even when the system is technically responding.

3. Why does the chapter stress defining service boundaries and ownership early in AI ops?

Show answer
Correct answer: To avoid on-call devolving into “everyone owns everything” and to make escalation clearer
Clear boundaries and ownership prevent confusion during incidents and support effective escalation.

4. What mindset does the chapter recommend adopting for AI ops monitoring and incident response?

Show answer
Correct answer: Ask whether outputs are safe and useful within latency and cost limits, not just whether the service is up
AI ops expands the “is it up?” question to include safety, usefulness, latency, and cost.

5. Which set of artifacts does the chapter say you should leave with to operationalize an LLM-powered feature?

Show answer
Correct answer: Initial SLAs/SLIs, a starter on-call checklist and escalation map, and an AI ops readiness scorecard
The chapter lists these concrete artifacts as practical outputs to bring back to your team.

Chapter 2: Logging for LLM Apps (Signals You’ll Actually Need)

In IT support, logs are your witness statements: they tell you what happened, when, and to whom. In LLM operations, logs are still your primary evidence—but the “incident surface” is broader. A single user request can trigger multiple model calls, retrieval lookups, tool executions, and safety filters, all while burning tokens and interacting with sensitive data. This chapter focuses on logging signals that are genuinely useful in production: signals that help you troubleshoot quickly, protect privacy, and control cost.

The goal is not “log everything.” The goal is to log the smallest set of structured events that lets you (1) reconstruct a timeline, (2) diagnose common failure modes (timeouts, tool errors, retrieval misses, prompt regressions), and (3) quantify impact (latency, error rate, token usage, and downstream quality indicators). As you read, map each practice to familiar IT support responsibilities: ticket triage, escalation with evidence, and postmortems with actionable findings.

You will design a log schema that covers prompts, tools, and retrieval without leaking secrets; implement correlation IDs across requests and model calls; build a simple query playbook for common user-reported issues; and define retention/redaction rules that satisfy compliance while keeping enough data to operate the system.

Practice note for Design a log schema for prompts, tools, and retrieval without leaking secrets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement correlation IDs across requests and model calls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a basic query playbook for common user-reported issues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Establish log retention and redaction rules for compliance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a log schema for prompts, tools, and retrieval without leaking secrets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement correlation IDs across requests and model calls: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a basic query playbook for common user-reported issues: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Establish log retention and redaction rules for compliance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Design a log schema for prompts, tools, and retrieval without leaking secrets: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Structured logging essentials (JSON fields, levels, events)

LLM apps fail in ways that plain text logs can’t explain. Free-form messages like “model call failed” force humans to guess what mattered: which model, which prompt version, what tokens, what tool call, what user context. Structured logs—typically JSON—turn those guesses into fields you can query, aggregate, and alert on.

Start with an “event-first” mindset. Instead of writing long messages, emit discrete events with consistent names, such as request.received, llm.call.started, llm.call.completed, rag.retrieve.completed, tool.call.failed, response.sent. Each event should be a single row/document in your log store, with fields that remain stable over time.

  • Core fields: timestamp, level, service, env, event, message (short), and correlation IDs (covered in 2.2).
  • Operational fields: latency_ms, status/status_code, error_type, retry_count, queue_time_ms.
  • Cost fields: model, provider, prompt_tokens, completion_tokens, total_tokens, estimated_cost_usd, and budget_tag/cost_center.

Use log levels deliberately. Overusing ERROR creates alert fatigue; overusing DEBUG explodes volume and cost. A practical guideline: INFO for state transitions you rely on to rebuild a timeline; WARN for degraded behavior (fallback model, partial retrieval, near-timeout); ERROR for failures that break a user flow; DEBUG for short-term investigations behind a feature flag or sampling rule.

Common mistakes: embedding entire prompts in a message string (unqueryable and risky), changing field names every sprint (“promptVersion” vs “prompt_version”), and logging raw exceptions without context. Treat your schema as an interface contract: version it, document it, and add fields without breaking existing queries.

Section 2.2: Correlation: request_id, session_id, trace_id, conversation_id

Correlation IDs are how you turn “a user complained” into a precise sequence of events. In classic IT support, you might track a ticket number across systems. In LLM apps, you need several identifiers because one user interaction can span multiple layers (frontend, API gateway, orchestrator, model provider, retrieval service, tool runtimes).

  • request_id: unique per inbound HTTP request. Every log line emitted while handling the request must include it. Generate at the edge if missing, and pass downstream.
  • trace_id: end-to-end distributed tracing identifier (OpenTelemetry style). A single request usually maps to one trace; spans capture sub-operations like retrieval and model calls.
  • session_id: represents an authenticated user session (or anonymous browser session). Useful for detecting repeated failures for a cohort without needing PII.
  • conversation_id: persistent ID for an LLM chat thread. This is crucial: multiple requests (turns) belong to one conversation, and many user issues (“it keeps forgetting”) are conversation-scoped rather than request-scoped.

Implement correlation in two steps. First, standardize propagation: put IDs in inbound headers, store them in request context, and ensure every service copies them to outbound calls (retrieval, tools, model provider). Second, standardize logging: your logger should automatically inject request_id, trace_id, and conversation_id into every event so engineers don’t forget.

Engineering judgement matters here: avoid overloading a single ID to represent everything. If you only log request_id, you can’t join turns across a conversation. If you only log conversation_id, you can’t isolate one failing request among many. The practical outcome is faster triage: you can pull all events for a trace to see exactly where latency spiked, then pivot to the conversation to see if the issue is systemic across turns.

Section 2.3: What to log in LLM flows (prompt versions, tool calls, RAG metadata)

LLM applications are pipelines. Logging needs to reflect pipeline stages so you can answer: “Did retrieval return the wrong context?”, “Did the tool fail?”, “Did a prompt change cause a regression?”, and “Did a provider throttling event trigger retries?” The best practice is to log metadata that characterizes each stage, not the entire content payload.

Prompting: Log a prompt_version (or template ID + git commit), system_prompt_id, policy_profile (e.g., safety ruleset), and prompt_render_ms. If you do A/B tests, log experiment_id and variant. Instead of raw prompt text, store a prompt_hash so you can group by prompt shape without exposing content. This directly supports the lesson: design a log schema for prompts without leaking secrets.

Model calls: For each llm.call.* event, log model, temperature, max_tokens, stop_reason (length, stop sequence, tool call), token counts, provider response codes, and retry/backoff metadata. These fields let you distinguish “model is slow” from “we are retrying due to rate limits.”

Tool calls (function calling): Emit events for tool.call.started and tool.call.completed with tool_name, tool_version, latency_ms, result_size_bytes, and error_type. Log arguments carefully: keep a redacted args_summary or a hash, and store structured fields for safe items (e.g., record_count, query_type).

RAG (retrieval): Log retriever, index_name, top_k, filters (redacted), doc_ids (or hashed IDs), scores (buckets or min/max), and context_tokens. Include retrieval_latency_ms and whether you used a fallback. These metadata fields are what you need to detect retrieval misses, stale indexes, and “good answer but wrong sources” incidents without storing the underlying documents in logs.

Section 2.4: PII/secrets handling: redaction, hashing, allowlists/denylists

LLM logs are high-risk because they tend to attract sensitive inputs: customer questions, account identifiers, pasted emails, access tokens, even private documents pulled via retrieval. Treat log safety as a first-class engineering feature, not a documentation footnote. The objective is to keep logs useful for operations while preventing accidental data exposure and meeting compliance requirements.

Use a layered approach:

  • Do not log by default: raw user messages, raw retrieved passages, file uploads, and tool arguments that may contain secrets. Store references, hashes, and safe summaries.
  • Redaction: apply pattern-based redaction for obvious secrets (API keys, JWTs, OAuth tokens) and PII patterns (emails, phone numbers). Redact before the event hits the logger, not after ingestion.
  • Allowlists over denylists: for tool arguments and metadata, explicitly allow safe fields. Denylists miss new secret formats; allowlists stay conservative.
  • Hashing: when you need grouping (e.g., “same email keeps failing”), store sha256(normalized_value + salt). This allows correlation without revealing the original value. Rotate salts per environment to reduce risk.

Common mistakes: “we’ll redact later in the log pipeline” (too late—data already replicated), logging exception objects that include request bodies, and copying prompts into tracing spans. Also watch for vendor SDKs that automatically log payloads at debug level; lock those settings down in production.

Make compliance practical: define a small set of data classes (Public, Internal, Confidential, Restricted) and map each log field to one class. Then enforce rules: Restricted fields never leave the request boundary; Confidential fields must be hashed or redacted; Internal fields can be logged with retention limits. The outcome is that incident responders can still answer “what broke?” without needing to see “what the user typed.”

Section 2.5: Log storage and retention patterns (hot/warm/cold, sampling)

Logging is not free. LLM apps can generate high-volume events (multiple spans per request) and token/cost telemetry you’ll want to keep long enough for budgeting and drift detection. You need storage patterns that balance speed, cost, and compliance.

A practical model is hot/warm/cold retention:

  • Hot (hours to days): full-fidelity logs and traces for active debugging. Fast queries, indexed fields, and minimal latency. Keep enough to cover typical incident detection and response windows.
  • Warm (weeks): reduced fidelity—drop verbose debug events, keep core lifecycle events and aggregated metrics. Useful for trend analysis, recurring incidents, and prompt/tool regression investigations.
  • Cold (months): cheapest storage for compliance and audits, typically with heavy sampling or summarized rollups. Access is slower and should be rare.

Sampling is your main volume control lever. Use head-based sampling for traces (sample a percentage of requests), but apply tail-based rules to keep high-value events: always retain errors, timeouts, high-latency outliers (p95/p99), and requests that exceed token thresholds. For logs, consider per-event sampling (e.g., keep 100% of tool.call.failed, 10% of llm.call.completed when successful). Ensure sampling decisions are logged (e.g., sampled=true) so analysts don’t misinterpret missing data.

Retention rules must align with redaction rules. If you truly must store some content (for example, a small excerpt for a regulated workflow), isolate it in a separate store with stricter access controls, shorter retention, and audit trails. The practical outcome: you can run cost-effective operations without discovering mid-incident that your logs are either too expensive to keep or too sparse to be useful.

Section 2.6: Debug queries: reproducing failures from logs and timelines

Once you have structured events and correlation IDs, the next operational skill is building a query playbook. In IT support terms, this is your “known-good diagnostic checklist” for the top user complaints. Your goal is to reconstruct a timeline, identify the failing stage, and capture evidence for escalation (to the app team, the model provider, or the vector database owner).

Build a standard timeline query: given a request_id or trace_id, fetch all events sorted by timestamp and display key fields (event, latency_ms, model, tool_name, status, error_type, total_tokens). This immediately answers “where did time go?” and “what failed first?” Then build pivots:

  • “The bot is slow”: filter llm.call.completed and group by model and provider_region; check p95 latency and retry counts. Verify whether retrieval latency or tool latency is the real bottleneck.
  • “It says it can’t access my data”: search for rag.retrieve.* and tool.call.* events; look for empty results (doc_count=0), permission errors, or filter mismatches. Compare across conversation_id to see if it’s persistent.
  • “Wrong answer after yesterday’s change”: compare error/quality proxy metrics by prompt_version and experiment_id. Use prompt_hash groupings to see which prompt shapes correlate with increased tool failures or higher token burn.
  • “Costs spiked”: aggregate total_tokens and estimated_cost_usd by budget_tag, model, and endpoint; identify top conversations and whether retrieval context size increased.

To reproduce failures safely, avoid re-running with raw user content. Instead, use the logged metadata: prompt/template version, tool name/version, retrieval parameters, and a synthetic test input that matches the shape (same language, same tool path). Capture the timeline in the incident record and include the smallest set of identifiers (trace_id, timestamps, model/provider codes) needed for escalation.

Common mistakes: debugging only at the “final answer” stage, ignoring retries (which hide provider errors but amplify latency and cost), and failing to tag logs with deployment version. Your practical outcome is consistent triage: for most tickets, you can identify the failing stage in minutes, not hours, and you can prove it with structured evidence rather than anecdotes.

Chapter milestones
  • Design a log schema for prompts, tools, and retrieval without leaking secrets
  • Implement correlation IDs across requests and model calls
  • Build a basic query playbook for common user-reported issues
  • Establish log retention and redaction rules for compliance
Chapter quiz

1. Which logging approach best matches the chapter’s goal for LLM apps in production?

Show answer
Correct answer: Log the smallest set of structured events needed to reconstruct timelines, diagnose failures, and quantify impact
The chapter emphasizes logging a minimal but sufficient set of structured signals, not “log everything” or “log nothing.”

2. Why does the chapter say the “incident surface” is broader for LLM apps than traditional systems?

Show answer
Correct answer: A single user request can span multiple model calls, retrieval lookups, tool executions, and safety filters
One request can trigger many downstream components, increasing places where failures and costs can arise.

3. Which combination of signals best supports measuring operational impact as described in the chapter?

Show answer
Correct answer: Latency, error rate, token usage, and downstream quality indicators
The chapter calls out impact-focused metrics like latency, error rate, token usage, and quality indicators.

4. What is the primary purpose of implementing correlation IDs across requests and model calls?

Show answer
Correct answer: To link related events so you can reconstruct an end-to-end timeline for a single user request
Correlation IDs connect events across components, enabling fast tracing and troubleshooting for one request.

5. How should retention and redaction rules be set according to the chapter’s priorities?

Show answer
Correct answer: Satisfy compliance while keeping enough data to operate and troubleshoot the system
The chapter stresses balancing compliance requirements with retaining enough evidence to run and support the system.

Chapter 3: Observability: Metrics, Traces, Dashboards, and Alerts

In IT support, you learn to ask consistent questions under pressure: “What changed?”, “Who is impacted?”, “Is this widespread or isolated?”, and “How do we restore service fast without making it worse?” AI Ops for LLM applications uses the same mindset, but the failure modes expand. You still have outages and latency spikes, yet you also have “soft failures” like degraded answer quality, runaway token spend, or retrieval drift that looks like user error until you instrument it.

This chapter gives you an observability toolkit for LLM systems: metrics, traces, dashboards, and alerts that let an on-call engineer answer the top questions quickly, validate alarms with synthetic checks, and document SLOs with error budgets tied to user impact. The emphasis is practical engineering judgment: choosing a small set of core metrics, structuring dashboards around decisions, and reducing alert noise while still catching real incidents.

Think of observability as a contract between builders and operators. Builders commit to emitting signals that map to user experience and cost. Operators commit to using those signals to triage, escalate, and improve the system with postmortems. In LLM apps, that contract must cover four domains at once: reliability (does it work), performance (is it fast), quality/safety (is it acceptable), and spend (is it affordable).

  • Metrics quantify trends and thresholds (latency, error rate, tokens).
  • Traces explain why a request was slow or wrong (which hop, which tool, which retrieval query).
  • Dashboards enable fast situational awareness (what’s broken, where, and for whom).
  • Alerts trigger action with minimal noise (page only when humans must intervene).

As you read the sections, keep one operational goal in mind: if someone wakes you up at 2 a.m., you should be able to identify the blast radius, confirm the symptom with a synthetic check, isolate the failing component in a trace, and decide whether to mitigate (rollback, degrade gracefully, switch model) while staying within your SLO error budget and cost guardrails.

Practice note for Choose core metrics for reliability, quality proxies, and spend: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a dashboard that answers the top 10 on-call questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set alert thresholds that reduce noise and catch real incidents: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Validate alerting with synthetic checks and canaries: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Document SLO error budgets tied to user impact: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose core metrics for reliability, quality proxies, and spend: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a dashboard that answers the top 10 on-call questions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: The golden signals for LLM apps (latency, errors, saturation, cost)

Start with “golden signals,” a concept borrowed from SRE, and adapt it to LLM apps. Traditional systems focus on latency, errors, and saturation. For LLM apps, add cost as a first-class signal because cost can fail your service just as surely as downtime (rate limits, budget caps, or sudden bill spikes forcing shutdowns).

Latency should be measured end-to-end (user request to final token) and per stage (retrieval, model call, tool calls, post-processing). Track percentiles (p50/p95/p99) rather than averages. A common mistake is alerting on mean latency; LLM latency distributions are often long-tailed due to tool timeouts, vector DB slowness, or long completions.

Errors must be categorized. A single “500 rate” hides critical differences: provider timeouts, tool invocation failures, retrieval empty results, safety filter blocks, and malformed outputs. Treat “errors” as “requests that fail user intent,” and tag them accordingly. In practice, you’ll maintain a taxonomy: transport errors (HTTP), application errors (exceptions), and semantic errors (invalid JSON, missing fields, policy refusal when user expected a normal answer).

Saturation in LLM apps includes CPU/memory like any service, but also concurrency limits, queue depth, provider rate limits, and vector DB capacity. A frequent operational pitfall is ignoring upstream quota; your app can be healthy while the LLM provider throttles you. Instrument “throttled requests,” “retry count,” and “queue wait time” to see saturation early.

Cost should be expressed in two forms: unit cost (cost per request, per conversation, per successful resolution) and rate cost (cost per minute/hour/day). Operators need both. Unit cost detects prompt bloat and inefficient chains; rate cost detects traffic spikes or abuse. These golden signals become your core reliability, quality-proxy, and spend metrics—small enough to stay focused, rich enough to diagnose.

Workflow tip: when a new feature ships (new tool, longer context, added RAG step), explicitly ask: which golden signal will move, and what is the acceptable range? That question prevents surprise regressions and sets you up to write SLOs later.

Section 3.2: Tracing LLM pipelines (API, retriever, vector DB, model, tools)

Metrics tell you that something is wrong; traces tell you where and often why. An LLM request is rarely a single call. It’s a pipeline: API gateway → orchestration layer → retriever → vector DB → reranker → model → tool calls (search, SQL, ticketing) → final response formatting. Without tracing, on-call teams end up guessing and rolling back blindly.

Implement distributed tracing with a consistent trace_id propagated across every hop. Each stage should create spans with standard attributes: component name, operation, start/end time, status, and key request tags (tenant, environment, route, model name, tool name). Add LLM-specific tags: prompt template version, retriever index version, top_k, and whether streaming was enabled. If you support multi-turn conversations, include a conversation_id and a turn index.

Practical judgment: do not store raw prompts and responses in traces by default. They can contain sensitive data and explode storage costs. Instead, log hashes, lengths, token counts, and redacted snippets, and enable “full payload capture” only for a sampled subset or for explicit debugging sessions with strict access controls.

Common tracing mistakes include: (1) a single monolithic span around the entire request (no stage detail), (2) missing correlation between logs and traces (no shared IDs), and (3) ignoring retries. Retries can make a request “look slow” while hiding the root cause (throttling or intermittent tool failures). Record retry count and backoff time as span attributes.

Operational outcome: with good traces, incident triage becomes systematic. When p95 latency alerts fire, you immediately open a trace waterfall for slow requests and answer the on-call questions: Is slowness in retrieval, the model provider, or a tool? Is it isolated to one tenant or route? Did it start after a deploy of a new index or prompt template? Tracing turns LLM pipelines into debuggable systems rather than black boxes.

Section 3.3: Token and throughput metrics (prompt/completion tokens, TPM/RPM)

LLM operations adds a new resource dimension: tokens. Tokens are both a performance input (longer prompts often mean slower responses) and a direct cost driver. Instrument token metrics with the same rigor you’d apply to CPU or database queries.

At minimum, capture per request: prompt_tokens, completion_tokens, total_tokens, model name, and whether the response was streamed. Aggregate these into percentiles and time-series by endpoint, tenant, and prompt template version. This lets you spot prompt bloat when a template change silently increases average prompt tokens by 40%.

Throughput matters at two layers: your service and the upstream provider. Track RPM (requests per minute) and TPM (tokens per minute) per model and per tenant. Many provider limits are expressed as RPM/TPM; you want to see when you’re approaching the ceiling before users see throttling. Include metrics for 429 rate, retry counts, and “time spent waiting for quota.”

Practical cost controls tie metrics to budgets. Create budget tags (team, feature, customer, environment) and ensure every request carries them in metrics/logs. Then you can build a “chargeback” view: cost by tenant per day, cost by feature per release, cost by incident (yes, incidents often have a cost signature like retry storms).

Common mistakes: (1) tracking only total tokens, not prompt vs completion—prompt bloat and runaway generation require different fixes; (2) ignoring caching—if you add response caching or embedding caching, instrument cache hit rate and saved tokens; (3) setting hard caps without graceful degradation. Instead, define guardrails: when token budgets are near limits, reduce context window, lower top_k, switch to a cheaper model, or disable optional tools while keeping the core experience alive.

Practical outcome: token and throughput observability turns “mysterious bill spikes” into actionable engineering work—optimize prompts, tighten retrieval, cap generation, and align capacity with demand.

Section 3.4: Quality and safety operational signals (refusals, citations, policy hits)

LLM incidents are not always outages. A model can be “up” while the product is effectively broken because answers are wrong, unsafe, or unhelpful. Since “quality” is hard to measure directly, operators rely on quality proxies—signals correlated with degraded outcomes.

Start with measurable events: refusal rate (how often the model refuses), policy hit rate (safety classifier blocks or warnings), and citation coverage for RAG systems (percentage of answers containing citations, number of citations, and “citation-to-claim” heuristics if available). Track retrieval empty rate (no documents found) and low similarity rate (top score below threshold). These often predict hallucinations and user dissatisfaction.

Also instrument output validity: JSON schema validation failures, tool-call parsing failures, and “self-contradiction” heuristics if your system uses structured formats. If your app uses function calling, track tool selection frequency and tool error rate. A tool error can manifest as a polite but useless answer; you want to see it as a distinct operational event.

Engineering judgment is crucial: do not page engineers for every safety refusal. Some refusals are correct and expected. Instead, alert on unexpected shifts: refusal rate doubling after a model version change, policy hits spiking for a single tenant (possible prompt injection), or citations dropping after a retriever deployment. Use dashboards to compare baseline windows (last 7 days) to current behavior.

Common mistakes include treating user feedback as the only quality signal. Feedback is lagging and biased. Combine it with leading indicators (retrieval health, schema validity, refusal patterns). Practical outcome: you can detect prompt drift, retrieval drift, and model behavior changes using operational signals, and you can triage “quality incidents” with the same discipline as reliability incidents.

Section 3.5: Dashboards: incident view vs product view vs exec view

Dashboards are decision tools. The fastest way to build a useful one is to design for the “top 10 on-call questions,” then create separate views for different audiences. Mixing everything into one dashboard produces noise and slows response.

Incident view answers: Is the service down? Who is impacted? Where is the bottleneck? What changed? Put your golden signals at the top: error rate, p95/p99 latency, saturation indicators (queue depth, throttles), and cost rate if relevant (retry storms can spike tokens). Include quick breakdowns by endpoint, tenant, region, and model. Add a deploy marker timeline so on-call can correlate incidents with releases.

  • Top row: availability and p95 latency (last 15m, last 1h).
  • Second row: error taxonomy (timeouts, 429s, tool failures, retrieval empty).
  • Third row: saturation (concurrency, queue wait, provider quota utilization).
  • Fourth row: token rate and cost rate (TPM, $/hour) for anomaly spotting.

Product view focuses on user experience and quality proxies over longer windows: completion success rate, refusal rate (expected vs unexpected), citation coverage, feedback trends, and “time to first token” for perceived responsiveness. Product teams use this to prioritize improvements and understand regressions after prompt or retriever changes.

Exec view is about outcomes and risk: SLO attainment, error budget remaining, cost versus budget, and major incident counts. Keep it sparse. Executives should not need to interpret span names or tool error codes.

Common mistake: building dashboards from what’s easy to measure rather than what’s needed to decide. Your practical outcome should be this: during an incident, an on-call engineer can open the incident view and, within two minutes, choose a mitigation path (rollback, fail over model, disable a tool, reduce context) with confidence.

Section 3.6: Alert strategy: burn rates, anomaly detection, and paging rules

Alerts should create action, not anxiety. In IT support terms: page only when a human must intervene now; otherwise route to a ticket, a Slack channel, or a daily report. LLM systems amplify alert noise because many metrics are naturally spiky (tokens, latency tails, provider throttles). You need clear paging rules and validation.

Use SLO-based burn rate alerts for reliability. Define an SLO tied to user impact (for example, “99.5% of requests succeed without user-visible error in 30 days,” or “p95 latency under 4 seconds for chat responses”). Then alert on fast burn (you’re consuming error budget too quickly in the last 5–30 minutes) and slow burn (a sustained issue over hours). Burn rate alerting reduces noise compared to static thresholds and focuses on what threatens your monthly reliability goals.

Use anomaly detection selectively for cost and quality proxies: sudden increases in TPM, cost/hour, refusal rate, retrieval empty rate, or citation drop. Anomaly detection is powerful, but it can page you for expected events (marketing launch, batch job). The operational fix is to scope anomalies by tags (only production, exclude load tests), add seasonality where supported, and pair anomaly alerts with a confirmation query (e.g., “Is RPM also up?”).

Define paging rules as a simple table: condition, severity, owner, and first mitigation. Example: “429 rate > 2% for 10m” pages the on-call because users are blocked; “cost/day forecast exceeds budget by 15%” creates a ticket to FinOps and the service owner because mitigation might be prompt optimization or quota changes, not an emergency rollback.

Validate alerts with synthetic checks and canaries. Run scripted requests that test critical paths (RAG retrieval, tool call, JSON output) and record their own metrics/traces. Canaries help you detect failures before users report them, and they help you confirm that an alert corresponds to a real user-visible issue. A common mistake is testing only “model responds” rather than “model responds with correct structure and citations.”

Finally, document your SLOs and error budgets in a runbook: what the SLO measures, what’s excluded, where the dashboards are, and what mitigations are allowed. Practical outcome: your alerting system becomes a reliable teammate—quiet most of the time, loud only when user impact and error budget risk justify waking someone up.

Chapter milestones
  • Choose core metrics for reliability, quality proxies, and spend
  • Create a dashboard that answers the top 10 on-call questions
  • Set alert thresholds that reduce noise and catch real incidents
  • Validate alerting with synthetic checks and canaries
  • Document SLO error budgets tied to user impact
Chapter quiz

1. Which scenario best represents a “soft failure” in an LLM application that observability should help detect?

Show answer
Correct answer: Degraded answer quality that looks like user error until instrumented
The chapter highlights soft failures such as degraded quality, runaway spend, or retrieval drift that may not look like a traditional outage without the right signals.

2. According to the chapter, what is the most practical way to structure dashboards for on-call use?

Show answer
Correct answer: Around decisions and the top on-call questions they need to answer quickly
Dashboards are emphasized as tools for fast situational awareness, structured around the top questions and decisions during incidents.

3. How do metrics and traces differ in the observability toolkit described in the chapter?

Show answer
Correct answer: Metrics quantify trends and thresholds, while traces explain where and why a request slowed or failed
Metrics are for trends/thresholds (latency, error rate, tokens), while traces show the causal path (which hop/tool/retrieval query).

4. What alerting approach best matches the chapter’s goal of reducing noise while catching real incidents?

Show answer
Correct answer: Trigger action with minimal noise and page only when humans must intervene
The chapter frames alerts as triggers for action with minimal noise, paging only when intervention is required.

5. In the chapter’s 2 a.m. on-call workflow, what should an engineer do to confirm the symptom before diving deep into traces?

Show answer
Correct answer: Run a synthetic check to validate the alarm and confirm the symptom
The chapter explicitly calls out validating alarms with synthetic checks (and canaries) before isolating the failing component via traces.

Chapter 4: Incident Response for LLM Systems (Triage to Postmortem)

Traditional IT support teaches you the core muscle memory of incident response: detect, triage, mitigate, communicate, and learn. LLM systems add new failure modes and new “damage patterns” (wrong answers, policy violations, runaway spend) that don’t always look like a typical outage. The goal of this chapter is to map familiar support workflows to AI Ops realities so you can move from “something is broken” to “we understand the failure mode, have a safe mitigation, and can prevent repeats.”

Good incident response starts with a structured decision tree. When alerts fire (latency, 5xx, token spikes, quality drops), you should be able to quickly ask: Is this a provider issue? A retrieval failure? A prompt or policy regression? Or an application change that amplified cost and errors? Your job is to turn ambiguous symptoms into a classification, then choose the mitigation that minimizes user harm and preserves safety. Throughout, prioritize reversible changes (feature flags, rollback, model switch) and leave clear breadcrumbs: logs, traces, dashboards, and a stakeholder timeline that makes the incident legible to everyone involved.

Finally, blameless postmortems are how you build trust and improve reliability. LLM incidents often involve multiple contributing factors: prompt tweaks, data changes in a vector store, provider throttling, and new traffic patterns. Treat these as system design inputs, not personal failures. Every incident should produce measurable follow-ups: monitors, runbooks, budgets/guardrails, and drills that make the next response faster and calmer.

Practice note for Run LLM incident triage using a structured decision tree: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Classify incidents by failure mode and pick the right mitigation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a clear incident update and stakeholder timeline: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Complete a blameless postmortem with measurable follow-ups: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Turn one incident into a hardened runbook and new monitors: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Run LLM incident triage using a structured decision tree: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Classify incidents by failure mode and pick the right mitigation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Write a clear incident update and stakeholder timeline: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Complete a blameless postmortem with measurable follow-ups: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Incident taxonomy for LLM apps (provider outage, RAG failure, prompt regression)

Section 4.1: Incident taxonomy for LLM apps (provider outage, RAG failure, prompt regression)

Incident classification is the fastest way to choose the right mitigation. In LLM applications, symptoms can look similar (users complain “the bot is bad”), but the underlying causes differ. Start with three common buckets and teach your on-call team to recognize their signatures in logs, traces, and metrics.

Provider outage or degradation shows up as increased latency, timeouts, elevated 429/5xx errors, or regional failures. Your traces will often show slow upstream spans (the model call) while your app stays healthy. Token usage may drop (requests failing) or spike (retries). The key is that your application code may be unchanged, yet performance suddenly shifts.

RAG (retrieval-augmented generation) failure often appears as “confidently wrong” answers, missing citations, irrelevant passages, or sudden drops in retrieval metrics (top-k scores, empty results, low recall). Operationally, you might see normal model latency but increased user re-prompts, longer conversations, and lower thumbs-up rates. Traces can reveal slow or failing vector DB calls, empty retrieval sets, or incorrect filters that exclude relevant documents.

Prompt regression occurs when a prompt, system instruction, tool schema, or safety policy changes and shifts behavior. The system remains “up,” but quality or compliance changes. Signals include increased moderation hits, higher refusal rates, broken tool calls (JSON schema errors), or a spike in token usage because the model rambles. A common mistake is treating this as “subjective quality” and delaying response; treat it as an incident when it impacts users, safety, or cost.

Other categories you should keep in your taxonomy: rate limiting/throttling due to traffic bursts, cost incidents (budget burn), data drift (new content distribution), and integration failures (tool endpoints returning 500s). The practical outcome is a shared vocabulary that turns vague reports into actionable labels, enabling faster triage and cleaner postmortems.

Section 4.2: Triage workflow: confirm impact, scope, and rollback options

Section 4.2: Triage workflow: confirm impact, scope, and rollback options

Triage for LLM systems should feel familiar to IT support, but your decision tree must include “quality and safety impact,” not just uptime. Begin with three confirmations: impact, scope, and reversibility.

1) Confirm impact: Identify what users experience and how you know. Use a small set of “golden signals” adapted for LLMs: request success rate, p95/p99 latency, token usage per request, and a quality proxy (thumbs-down rate, escalation-to-human rate, citation coverage, or policy violation rate). Validate with a few real examples from logs (redacted) or a canary replay. Common mistake: relying on a single metric like error rate while a quality regression silently harms users.

2) Confirm scope: Is it all traffic or a segment? Slice by model, region, tenant, endpoint, and feature flag. LLM incidents often affect one route (e.g., “/summarize” only), one retrieval index, or one customer with a different prompt template. Your tracing should let you follow a request through: gateway → prompt builder → retrieval → model call → post-processor. If you cannot determine scope quickly, that is itself a reliability gap to fix later.

3) Identify rollback and safe-mode options: Before you attempt a “fix,” list the reversible actions you can take in minutes: revert prompt version, roll back retrieval filters, switch to a smaller model, disable a tool, or enable cached responses. Decide who can approve each action (on-call, incident commander, product owner) and what guardrails apply (safety must not degrade).

Document the triage steps as a structured checklist so the responder doesn’t improvise under stress. The outcome of triage is not “root cause found” but “classified incident, scoped blast radius, and a mitigation plan with a rollback path.”

Section 4.3: Mitigations: rate limits, fallbacks, circuit breakers, caching, model switch

Section 4.3: Mitigations: rate limits, fallbacks, circuit breakers, caching, model switch

Once you’ve classified the incident, choose mitigations that reduce user harm quickly. In LLM systems, the safest mitigations are often traffic-shaping and graceful degradation rather than complex hotfixes.

Rate limits and load shedding protect downstream providers and your budget. If a retry storm begins (timeouts leading to retries), cap retries and introduce jitter. Apply per-tenant and per-endpoint limits to stop one customer or feature from consuming the entire quota. A practical guardrail is a “tokens per minute” limit with alerting when you approach it.

Fallbacks keep the user experience intact when one component fails. Examples: if RAG retrieval returns empty, switch to a curated FAQ response; if tool execution fails, return an answer without tool augmentation plus a disclaimer; if the premium model is degraded, fall back to a smaller model for non-critical endpoints. Ensure fallbacks are explicitly tested; an untested fallback often becomes the second outage.

Circuit breakers stop cascading failures. If the vector DB latency exceeds a threshold or error rate spikes, open the circuit and skip retrieval for a short window rather than letting threads pile up. Pair this with clear user messaging (e.g., “Using general knowledge mode; citations may be unavailable”).

Caching can stabilize both latency and cost. Cache embeddings for repeated documents, cache retrieval results for identical queries within a time window, and consider response caching for deterministic prompts. Be careful: caching can mask regressions and can leak data if cache keys aren’t tenant-scoped. Always include tenant/user boundaries and TTLs.

Model switch and prompt rollback are the fastest “undo” buttons. Maintain versioned prompts and a known-good baseline. For model switching, pre-validate that output formats and tool schemas are compatible; otherwise you’ll trade one incident for another (schema errors, broken JSON). The practical outcome is an incident playbook where each mitigation is linked to specific telemetry that tells you whether it helped.

Section 4.4: Communications: status pages, internal updates, customer messaging

Section 4.4: Communications: status pages, internal updates, customer messaging

Clear communication is part of mitigation. For LLM incidents, confusion spreads quickly because “bad answers” can feel random. Your updates must translate technical uncertainty into user-facing clarity while preserving trust and safety.

Internal updates should be frequent, structured, and timestamped. Use a consistent template: what we see (metrics and examples), user impact (who/what is affected), suspected failure mode (provider, RAG, prompt, tool), current mitigations, next steps, and when the next update will arrive. This creates a stakeholder timeline you can later reuse in the postmortem. A common mistake is burying the lead—start with impact and scope, not speculation.

Status pages should reflect user outcomes, not only infrastructure health. If latency is fine but answers are unsafe or incorrect due to retrieval drift, you may still need to post an incident because the product is effectively degraded. Describe symptoms and workarounds (e.g., “citations may be missing; use human escalation for critical decisions”).

Customer messaging must be specific and non-technical: what happened, what it means for them, and what you’re doing. Avoid implying the model is “thinking wrong”; instead describe the system behavior (“retrieval service returned incomplete data”). For regulated contexts, include whether any data exposure occurred. When the incident involves cost or throttling, explain limits and expected recovery times.

Assign a single communications owner (often the incident commander or a delegate) so messages don’t conflict. The practical outcome is reduced escalation noise, faster decision-making, and a record that supports accountability without blame.

Section 4.5: Postmortems: 5 whys, contributing factors, and action item ownership

Section 4.5: Postmortems: 5 whys, contributing factors, and action item ownership

LLM postmortems should be blameless, evidence-based, and tuned to sociotechnical causes. The goal is learning and prevention, not a courtroom. Start with a crisp incident narrative: detection time, impact start, mitigation steps, recovery time, and confirmation of full resolution.

Use 5 Whys carefully. The “why” chain for LLM incidents often branches. For example: “Why did users get incorrect answers?” because retrieval returned irrelevant documents; why irrelevant? because an index rebuild changed chunking; why changed? because a deployment used a new tokenizer; why not caught? missing offline evaluation and no monitor for citation coverage. Stop when you reach an actionable system change, not a person’s decision.

Capture contributing factors, not just a single root cause: alert thresholds too loose, lack of canary tests, on-call runbook missing a model-switch step, provider rate limit undocumented, or ambiguous ownership between app and data teams. LLM apps frequently fail at boundaries (prompt builder vs retrieval vs tool execution), so include interface contracts and schema validation in your analysis.

Action items must have owners and measures. “Improve monitoring” is not an action item. A good item looks like: “Add alert when empty-retrieval rate > 2% over 10 minutes for tenant-scoped traffic; owner: Search Platform; due: 2026-04-15.” Tie actions to prevention mechanisms: automated prompt regression tests, cost budgets with alerts, circuit breakers, or better redaction in logs for faster diagnosis.

End with what went well and what to improve in the response process (handoffs, escalation, comms). The practical outcome is fewer repeats and a steadily maturing AI Ops culture.

Section 4.6: Runbooks and drills: game days, tabletop exercises, and learning loops

Section 4.6: Runbooks and drills: game days, tabletop exercises, and learning loops

A runbook turns one hard-earned incident into repeatable competence. For LLM systems, runbooks should be decision-tree driven and telemetry-linked: “If 429s spike and provider latency rises, do A; verify with B; rollback with C.” Include screenshots or exact query links for dashboards so responders don’t waste time hunting.

Runbook essentials include: how to identify failure mode (provider vs RAG vs prompt), where to find correlated logs/traces, how to flip feature flags, model-switch procedures, and safe-mode behavior. Add a “data hygiene” section: how to handle redaction, PII, and prompt contents in incident artifacts. A common mistake is writing a runbook that reads like a design doc; keep it procedural and time-oriented.

Game days validate the runbook in realistic conditions. Inject failures such as: vector DB latency increase, embedding job producing empty vectors, prompt template change causing JSON parse errors, or a sudden token-cost surge from a looping tool call. Measure time-to-detect, time-to-mitigate, and how confidently the team classified the incident. Use these drills to refine monitors and alert routing.

Tabletop exercises are lighter-weight and excellent for cross-functional alignment. Walk through a scenario with engineering, support, product, and legal: who approves a model fallback, when to post to status, and how to message a quality regression. Capture gaps as backlog items.

Close the loop by feeding learnings back into instrumentation, alerts, budgets, and deployment practices. The practical outcome is an organization that responds to LLM incidents with the same rigor as classic production incidents—plus the extra care needed for quality, safety, and cost.

Chapter milestones
  • Run LLM incident triage using a structured decision tree
  • Classify incidents by failure mode and pick the right mitigation
  • Write a clear incident update and stakeholder timeline
  • Complete a blameless postmortem with measurable follow-ups
  • Turn one incident into a hardened runbook and new monitors
Chapter quiz

1. Why does Chapter 4 emphasize using a structured decision tree during LLM incident triage?

Show answer
Correct answer: To turn ambiguous symptoms into a failure-mode classification and choose a safe mitigation quickly
A decision tree helps map alerts and symptoms (latency, 5xx, token spikes, quality drops) to a likely failure mode so you can select an appropriate mitigation that minimizes harm and preserves safety.

2. Which set of incident symptoms is specifically highlighted as common alert signals in LLM systems?

Show answer
Correct answer: Latency, 5xx errors, token spikes, and quality drops
The chapter lists latency, 5xx, token spikes, and quality drops as example alerts that should trigger structured triage.

3. During triage, which question aligns with the chapter’s guidance for narrowing down the failure mode?

Show answer
Correct answer: Is this a provider issue, retrieval failure, prompt/policy regression, or an application change amplifying cost and errors?
The chapter recommends classifying incidents by likely source (provider, retrieval, prompt/policy, or application change) to guide mitigation.

4. What mitigation approach does the chapter recommend prioritizing during an active incident?

Show answer
Correct answer: Reversible changes such as feature flags, rollback, or a model switch
Reversible mitigations reduce risk while restoring stability, and they support faster recovery if an action has unintended effects.

5. What should a blameless postmortem produce according to the chapter?

Show answer
Correct answer: Measurable follow-ups such as monitors, runbooks, budgets/guardrails, and drills
The chapter frames postmortems as system-learning tools that generate concrete improvements to prevent repeats and speed future response.

Chapter 5: Cost Budgets and FinOps for LLM Apps (Keep Spend Predictable)

In IT support, you learned that “availability” is not just uptime—it is also the ability to keep operating within constraints: capacity, licensing, and budgets. LLM applications add a new constraint that behaves like a utility meter: tokens and tool usage. The engineering challenge is not merely to reduce spend; it is to make spend predictable, attributable, and controllable so teams can ship features confidently without financial surprises.

This chapter translates FinOps thinking into concrete AI Ops workflows. You will model token-based costs, set monthly budgets per environment (dev/test/prod), and implement guardrails such as quotas, limits, and feature flags. You will also learn to attribute spend to teams and features using tags and per-request cost estimates, detect anomalies, and maintain a cost optimization backlog with simple ROI estimates. Finally, you’ll publish a weekly cost report that doesn’t just inform—it drives action.

The core mindset shift: treat cost as an operational signal alongside latency and errors. If you can alert on a 500 rate spike, you can alert on a cost-per-request spike. If you can run a postmortem for downtime, you can run a postmortem for budget burn. The goal is a closed loop: instrument → budget → guardrail → detect → optimize → report.

Practice note for Model token-based costs and set monthly budgets per environment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement cost guardrails: limits, quotas, and feature flags: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Detect cost anomalies and attribute spend to teams and features: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a cost optimization backlog with ROI estimates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Publish a weekly cost report that drives action: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Model token-based costs and set monthly budgets per environment: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement cost guardrails: limits, quotas, and feature flags: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Detect cost anomalies and attribute spend to teams and features: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a cost optimization backlog with ROI estimates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Cost drivers: tokens, context length, tool calls, embeddings, vector search

LLM costs rarely come from “the model” alone. In production, spend is the sum of several metered activities, and your first job is to identify which meters matter for your app. The primary meter is tokens: input tokens (prompt + conversation history + retrieved context) and output tokens (the model’s response). Context length is the lever that silently inflates input tokens—especially in chat apps that append every prior turn and multiple documents.

Next are tool calls. Many modern agents call tools (search, database queries, ticketing APIs) and then feed results back into the model. Tool calls can drive cost in two ways: direct vendor/API charges and indirect token charges when verbose tool outputs are inserted into the prompt. A common mistake is returning entire JSON payloads or full HTML pages; you pay for every token you paste into context.

  • Embeddings: Creating embeddings for documents and user queries is usually priced per token embedded. Embedding costs can be steady (batch indexing) or spiky (re-embedding after a content change). Track both “indexing embeddings” and “query embeddings.”
  • Vector search: The vector DB may charge per read/write units, storage, or query throughput. Even if it’s “cheap,” vector search can drive downstream LLM tokens by returning too many or too-long chunks.
  • Retries and fallbacks: Timeouts, rate limits, or parsing errors can cause retries. Each retry repeats token spend unless you design idempotent caching.

Practically, model your cost per request as: (input_tokens × input_rate) + (output_tokens × output_rate) + (tool/API charges) + (embedding charges) + (vector query charges). In AI Ops, you want this broken down by request path (chat, summarization, classification), because the remediation differs: trimming context reduces tokens; RAG tuning reduces retrieved text; tool output summarization reduces tool-to-context bloat.

Section 5.2: Budgeting: forecasting, seasonality, and scenario planning

Budgeting for LLM apps works best when you treat it like capacity planning with uncertainty. Start by defining environments and their intent. Dev is for iteration (higher variance, lower volume), staging is for validation (moderate volume, strict controls), and prod is for customers (highest volume, highest risk). Set monthly budgets per environment, not a single global number, so experimentation doesn’t accidentally consume customer-facing runway.

Build a simple forecast from operational metrics you already collect: requests per day, average input tokens, average output tokens, and tool-call rate. Multiply by current pricing and include a buffer for retries and peak days. Then add seasonality and business events: product launches, marketing campaigns, end-of-month reporting spikes, or support ticket surges. Even a basic weekly seasonality factor (weekday vs weekend) improves predictability.

  • Baseline scenario: current traffic and current cost per request.
  • Growth scenario: traffic increases by X% while cost per request stays flat.
  • Risk scenario: traffic increases and cost per request increases (e.g., longer context, more tool calls, new features).

Common budgeting mistakes include using averages without percentiles (p95 context length often drives burn), ignoring “hidden” costs (embeddings, vector DB), and failing to separate one-time indexing events from steady-state usage. Engineering judgment matters: if your roadmap includes an agentic feature that calls tools multiple times, budget it explicitly as a new cost center with its own forecast.

Operational outcome: you can say, “If we keep p95 input tokens under 6k and maintain current volume, prod will stay within $N/month; if not, we will hit the guardrails by week three.” That statement is actionable and testable.

Section 5.3: Guardrails: max_tokens, timeouts, caching, batching, and truncation

Guardrails are the difference between “we hope costs stay low” and “costs cannot exceed our tolerance.” In IT support terms, guardrails are like disk quotas, rate limits, and change controls. You implement them in layers: request-level limits, system-level quotas, and feature-level kill switches.

Start with max_tokens (or equivalent) on every generation. Without it, long-winded outputs can double spend with no user value. Pair it with timeouts so stuck tool calls or slow model responses don’t trigger repeated retries. Then enforce truncation: cap conversation history, cap retrieved document tokens, and summarize older turns into a compact memory. Truncation is not just chopping text; do it intentionally—keep the user’s goal, constraints, and key facts, not verbatim logs.

  • Caching: Cache deterministic or near-deterministic responses (policy answers, boilerplate explanations) and cache tool results (e.g., “ticket status”) with a short TTL. A frequent mistake is caching only the final LLM response while still paying for repeated tool calls and embeddings.
  • Batching: For embeddings and offline jobs, batch requests to reduce overhead and smooth spikes. For online inference, batching depends on latency targets; use it where you can (e.g., background summarizations).
  • Feature flags and quotas: Gate expensive features (agent mode, deep research, large context) behind flags, and apply per-user or per-tenant quotas. This enables progressive rollout and quick containment during incidents.

Practical workflow: define “budget protection modes.” Example: Normal (full features), Conserve (shorter context, smaller model fallback), Emergency (disable tool calls, return minimal answer). Tie mode switching to both cost burn rate and operational errors, and document it like a runbook so on-call staff can act decisively.

Section 5.4: Attribution: tagging, per-request cost estimation, chargeback/showback

If you can’t attribute spend, you can’t manage it. Attribution turns “LLM bill is high” into “Feature X in Team Y is driving 62% of token growth due to longer retrieved context.” Implement attribution the same way you implement structured logging: consistently, automatically, and early in the request path.

Add tagging to every request: environment, app/service name, team/owner, feature flag state, tenant/customer, endpoint/route, model name, and version. If you use a gateway, enforce tags there to avoid “unknown” spend. In logs and traces, include token counts, tool-call counts, embedding tokens, and retrieval stats (top_k, chunk sizes). This enables per-request cost estimation even before the invoice arrives.

Per-request cost estimation is straightforward: multiply token counts by your model’s published rates and add known unit costs (vector query units, embedding tokens). Store the computed estimate in a metric (e.g., estimated_cost_usd) and aggregate by tag. This supports both showback (visibility without billing) and chargeback (internal billing) depending on org maturity.

  • Showback: weekly report by team and feature, highlight top spenders and trends.
  • Chargeback: allocate costs to budgets; require justification for sustained overages.

Common mistakes: tagging only at the service level (missing feature-level detail), not versioning prompts (you cannot attribute cost spikes to prompt changes), and failing to distinguish “user initiated” versus “system initiated” calls (background jobs can be silent budget killers). A strong practice is to attach a “prompt_template_id” and “retriever_config_id” so cost increases can be traced to configuration, not guesses.

Section 5.5: Cost anomaly detection and automated throttling policies

Cost incidents often look like normal traffic—until you compute cost per request, cost per user, or cost per successful outcome. Build anomaly detection on ratios and burn rates, not just totals. Examples that catch issues early: sudden increase in input tokens per request, increase in tool calls per conversation, spike in retrieval chunks returned, or higher retry rate. These are often caused by a prompt change, a retrieval misconfiguration, or an upstream content shift that makes documents longer or noisier.

Implement alerts like you would for latency SLOs: define thresholds, baseline windows, and paging policies. A useful pattern is “budget burn rate” alerts: if you’ve spent 60% of the monthly budget by day 10, page the on-call and the feature owner. Pair this with anomaly alerts on p95 tokens, because p95 blowups can drain budgets even when averages look fine.

  • Automated throttling: when burn rate exceeds threshold, reduce max_tokens, lower top_k, or switch to a smaller model for non-critical traffic.
  • Progressive degradation: keep core flows running while disabling expensive optional paths (deep reasoning mode, multi-step agent loops).
  • Blocklists and circuit breakers: temporarily block abusive tenants or malfunctioning clients; stop infinite tool-call loops.

Engineering judgment: automation should be reversible and observable. Every throttle action must emit an event (who/what triggered it, which policy applied) and be visible on dashboards. A common mistake is “silent throttling” that lowers quality without informing support; this creates user confusion and extra tickets. Treat cost throttling as an incident response tool: communicate in status pages, annotate dashboards, and run a postmortem when guardrails trigger unexpectedly.

Section 5.6: Optimization patterns: smaller models, prompt compression, RAG tuning

Once spend is measurable and controlled, optimization becomes a backlog, not a panic. Create a cost optimization backlog where each item includes: expected savings, expected quality risk, engineering effort, and a validation plan. Estimate ROI simply: monthly_savings / engineer_weeks (or similar) to prioritize. Avoid optimizing what you can’t attribute; always start from the top cost drivers revealed by tags and per-request estimates.

High-leverage patterns start with model selection. Use smaller models for classification, routing, extraction, and draft generation; reserve larger models for complex reasoning or high-stakes responses. Implement “model cascading”: try a cheaper model first, escalate only when confidence is low (based on heuristics or evaluator signals). Another strong pattern is prompt compression: remove verbose instructions, reuse system prompts via references if your platform supports it, and replace long examples with shorter, representative ones. Summarize tool outputs and retrieved text before passing to the main model.

RAG tuning often yields the best token savings without harming quality. Reduce chunk size bloat, tune top_k, add a reranker to select fewer but better passages, and enforce a maximum retrieved token budget. If documents are repetitive, deduplicate at ingestion. If user queries are broad, add query rewriting to improve retrieval precision so you don’t compensate by retrieving more.

  • Cache embeddings and retrieval: repeated questions should not re-embed and re-retrieve every time.
  • Stop sequences and structured outputs: constrain generation to avoid rambling.
  • Evaluate before and after: validate cost changes alongside quality metrics, not anecdotes.

Close the loop with a weekly cost report that drives action: top changes in spend, top drivers by feature/team, guardrail triggers, and the status of optimization backlog items. Include “next week commitments” (owners and dates). This is where FinOps becomes operational: visibility turns into decisions, and decisions turn into predictable spend.

Chapter milestones
  • Model token-based costs and set monthly budgets per environment
  • Implement cost guardrails: limits, quotas, and feature flags
  • Detect cost anomalies and attribute spend to teams and features
  • Create a cost optimization backlog with ROI estimates
  • Publish a weekly cost report that drives action
Chapter quiz

1. In this chapter, what is the primary engineering goal for LLM spend in production?

Show answer
Correct answer: Make spend predictable, attributable, and controllable so teams can ship without surprises
The chapter emphasizes predictability, attribution, and control—not just reducing spend.

2. How does the chapter recommend treating cost within AI Ops workflows?

Show answer
Correct answer: As an operational signal alongside latency and errors, with alerting and postmortems
The mindset shift is to operationalize cost like other reliability signals.

3. Which set best represents the chapter’s suggested cost guardrails?

Show answer
Correct answer: Quotas, limits, and feature flags
Guardrails are described as limits, quotas, and feature flags to control spend.

4. What is the purpose of attributing spend to teams and features?

Show answer
Correct answer: To make cost actionable by identifying who/what is driving spend using tags and per-request estimates
Attribution enables ownership and targeted action, supported by tagging and per-request cost estimates.

5. What makes the weekly cost report effective according to the chapter?

Show answer
Correct answer: It drives action by feeding a closed loop: instrument → budget → guardrail → detect → optimize → report
The chapter describes a closed-loop process where reporting leads to concrete follow-up actions.

Chapter 6: Operating in Production (Playbooks, Governance, and Career Moves)

In IT support, “production-ready” usually means the service starts, stays up, and has a clear path to restore service when it doesn’t. In AI Ops for LLM applications, those basics still apply—but the failure modes multiply. A successful release can still harm users through hallucinations, policy violations, runaway costs, or subtle quality regressions that only appear in certain queries. This chapter turns your support instincts into operational discipline for LLM systems: playbooks and KPIs you can defend, governance that controls change without freezing delivery, and a career plan that converts your experience into credible AI operations stories.

Your goal is to make the system observable and governable. Observable means: when a ticket arrives (“answers are wrong,” “it’s slow,” “cost doubled”), you can tie symptoms to traces, structured logs, retrieval results, model versions, and token usage. Governable means: prompts, retrieval settings, and model versions move through a controlled release process with approvals, rollback plans, and audit trails. When you combine these, you reduce chaos—incidents become repeatable workflows rather than heroic debugging sessions.

We’ll build toward a production-ready AI ops playbook and KPI set, define change control for prompts/models, and finish with practical portfolio artifacts (dashboards, runbooks, and a postmortem sample). Finally, you’ll translate all of it into a 30/60/90-day transition plan and interview-ready narratives.

Practice note for Assemble a production-ready AI ops playbook and KPI set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set governance for changes: prompt/model releases and approvals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a 30/60/90-day transition plan from IT support to AI ops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a portfolio artifact: dashboards, runbooks, and postmortem sample: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Prepare for interviews with AI ops scenarios and metrics stories: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Assemble a production-ready AI ops playbook and KPI set: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Set governance for changes: prompt/model releases and approvals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Create a 30/60/90-day transition plan from IT support to AI ops: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a portfolio artifact: dashboards, runbooks, and postmortem sample: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Production checklists: readiness, dependencies, and rollback plans

Production checklists are not paperwork; they are how you turn tribal knowledge into repeatable safety. For LLM apps, a readiness checklist should cover (1) runtime dependencies, (2) observability, (3) safety controls, and (4) rollback. Start with dependencies: model provider availability/SLA, embedding service, vector database, feature flags, secrets management, and any PII redaction service. In IT support terms, this is your dependency map and escalation tree—write it down and keep it current.

Next, define what “healthy” means before you deploy. Require structured logs that include request_id/trace_id, user and tenant identifiers (hashed if needed), prompt template version, retrieval config version, model name/version, token counts, latency breakdown (gateway, retrieval, model), and policy outcomes (blocked, redacted, allowed). A common mistake is only logging the final answer. In production, you need the steps: what was retrieved, what prompt was used, and why the model responded the way it did.

Rollback is where LLM operations differ sharply from classic app releases. You may need to rollback a prompt template, a retrieval parameter (top_k, filters), or a model choice. Create a rollback plan that is testable: use feature flags to revert prompt versions, keep previous retrieval indexes or snapshots available, and maintain an allowlist of “known good” model versions. Document the rollback trigger conditions (e.g., P95 latency +30%, policy violations +0.5% absolute, cost/run +20%).

  • Readiness checklist minimums: dashboards exist, alerts tested, on-call rota confirmed, runbook updated, rate limits configured, budget guardrails enabled, and a rollback path verified in staging.
  • Pre-prod testing: run a canary with real traffic slices, replay anonymized queries, and validate both correctness and cost. Include “nasty” prompts (jailbreaks, long context, ambiguous questions).
  • Operational handoff: define who owns first response, who can approve rollback, and where incident comms happen.

The practical outcome is simple: when something breaks at 2 a.m., you are not inventing a process—you are following one. That is the foundation for an AI ops playbook.

Section 6.2: Change management for prompts, retrieval configs, and models

LLM systems change in more places than traditional software. You might ship no code and still change behavior by editing a prompt, swapping an embedding model, tweaking retrieval filters, or upgrading the underlying LLM. Treat these as first-class releases with versioning and approvals. Your change management goal is to enable frequent iteration while reducing surprise.

Prompts should be stored like code: in version control, with named versions, review gates, and release notes. Avoid the common anti-pattern of editing prompts directly in a console without history. For retrieval, version the configuration (chunk size, overlap, top_k, reranker choice, metadata filters) and treat index rebuilds as risky operations with their own change tickets. For models, track the vendor model ID, parameters (temperature, max tokens), and safety settings. If you use multiple models (router patterns), version the routing rules too.

Adopt an approval workflow that matches risk. Low-risk changes (copy edits, metadata tag tweaks) can be fast-tracked with peer review. High-risk changes (policy-sensitive prompts, new model families, new data sources) should require a change advisory step: security/privacy sign-off, legal policy alignment, and an SRE/ops readiness check. Keep the process lightweight, but explicit.

  • Release package: what changed, why, expected impact on quality/latency/cost, how to validate, and how to rollback.
  • Progressive delivery: canary to 1–5% traffic, then ramp. Compare KPI deltas to baseline before full rollout.
  • Validation set: maintain a small suite of representative queries with expected outcomes and “quality proxy” checks (e.g., citation present, no PII leakage, correct tool use).

Engineering judgment matters here: you are balancing innovation speed with the operational risk of silent regressions. The practical outcome is fewer “mystery incidents” caused by untracked prompt edits or retrieval tuning that inadvertently drops relevant context.

Section 6.3: Access control and audit trails for LLM operations

LLM operations often touch sensitive data and expensive resources. Access control is not just about compliance; it is an operational safety mechanism. Apply least privilege across three layers: (1) who can view data (logs, prompts, retrieved documents), (2) who can change behavior (prompt/config/model releases), and (3) who can spend money (API keys, quota increases, budget overrides).

Start by separating roles. Typical roles include: Viewer (read-only dashboards), Operator (triage incidents, run queries, view redacted logs), Maintainer (merge prompt/config changes), Approver (release to production), and Security/Privacy (review data handling). Tie these roles to your identity provider and use time-bound elevation for emergencies. A common mistake is sharing a single API key across environments or teams; that destroys auditability and makes cost attribution impossible.

Audit trails should be designed, not assumed. Ensure every production request can be traced to: user/tenant, app version, prompt version, retrieval config version, model version, and the identity that approved the release. Also log administrative actions: who changed a prompt, who rebuilt an index, who increased rate limits, who disabled a safety filter, and when. Store these logs immutably where possible and set retention based on your policy.

  • Data minimization: redact or hash user inputs when feasible; avoid storing full prompts/responses unless required for debugging and governed by policy.
  • Environment separation: different keys and indices for dev/staging/prod; prevent “test data” from contaminating production retrieval.
  • Break-glass procedure: documented emergency access with mandatory post-incident review.

The practical outcome is faster, safer incident response. When a stakeholder asks, “Who changed the prompt?” you can answer with evidence, not guesswork.

Section 6.4: KPIs that matter: reliability, user impact, quality proxies, and cost

KPIs are your operating language: they align engineering, support, and leadership on what “good” looks like. For LLM apps, you need a balanced scorecard across reliability, user impact, quality proxies, and cost. If you optimize only one dimension, you will break another (for example, pushing temperature down may reduce variability but can also reduce helpfulness; aggressive context truncation lowers cost but harms accuracy).

Reliability KPIs should look familiar: availability of the LLM gateway, error rate by type (timeouts, provider 5xx, tool failures, retrieval empty results), and latency percentiles (P50/P95/P99). Add LLM-specific breakdown: retrieval latency, model latency, tool-call latency. User impact KPIs translate those into experience: session success rate, abandonment rate, escalation-to-human rate, and “resolved without follow-up” signals.

Quality is the hard part, so use practical proxies. Track citation rate (if you require sources), refusal/guardrail hit rate, policy violation rate, and “answer changed after follow-up” as a confusion signal. Use lightweight human review on sampled conversations and tag outcomes (correct, incorrect, unsafe, incomplete). Avoid the mistake of treating a single LLM-as-judge score as truth; it’s a useful signal, not an absolute metric.

Cost KPIs must be first-class: tokens per request, tokens per successful task, cost per tenant/team, cache hit rate, and spend vs budget. Set guardrails: per-request max tokens, per-tenant daily caps, and alerts for spend anomalies. Forecast spend using recent traffic and token trends, and use chargeback tags (team, feature, environment, customer) so cost spikes have an owner.

  • Example KPI set: P95 latency, error rate, empty-retrieval rate, safety block rate, escalation rate, tokens/request, cost/day, and quality-sample pass rate.
  • Alerting strategy: alert on deltas from baseline (sudden changes), not only absolute thresholds.

The practical outcome is a KPI set you can put in a weekly ops review: what changed, why it changed, and what you will do next.

Section 6.5: Portfolio assets: incident report, runbook, cost policy, dashboards

Your portfolio is how you prove you can operate LLM systems. Aim for four artifacts that mirror real work: an incident report (postmortem), a runbook, a cost policy, and dashboards. These can be built from a personal project, a sandbox app, or sanitized examples—what matters is operational realism.

A strong incident report includes: timeline (detection → mitigation → resolution), customer impact, technical root cause (e.g., retrieval index rebuild removed metadata filter; prompt change increased tool calls), contributing factors, and action items with owners and dates. Include LLM-specific evidence: prompt/model versions, token spikes, traces showing where latency grew, and examples of failure outputs (redacted). The common mistake is writing a narrative without data; instead, link each claim to a metric or log excerpt.

Your runbook should read like something an on-call engineer can execute under pressure. Include: symptoms, quick checks, decision tree, safe mitigations (feature-flag rollback, lower max tokens, switch to fallback model), escalation contacts, and comms templates. Make sure it covers at least three LLM failure modes: provider outage/latency, retrieval returning irrelevant/empty context, and safety/policy regression after a prompt release.

A cost policy should define budgets, tags, and enforcement. Example sections: per-environment quotas, approval required for budget increases, caching expectations, and when to enable cheaper fallback models. Include a small forecast table and explain chargeback tags.

Dashboards should tell a story at a glance: traffic, latency breakdown, error taxonomy, token usage, spend, quality proxies, and top regressions by version. Build at least one “release comparison” view that overlays KPIs before/after a prompt or model change.

The practical outcome: you can hand an interviewer a single link or PDF package that demonstrates you know how to run production, not just prototype.

Section 6.6: Career path: titles, responsibilities, and interview preparation

Career transitions work best when you translate what you already do into the new domain. IT support already builds the muscle for triage, stakeholder communication, prioritization, and operational hygiene. AI ops adds new technical primitives (tokens, prompts, retrieval, model selection) and new risk categories (safety, privacy, drift). Common job titles include: AI Operations Specialist, LLM Ops Engineer, Prompt/LLM Reliability Engineer, AI Platform Support Engineer, Observability Engineer (AI), and Site Reliability Engineer (AI Systems). Responsibilities typically span on-call response, dashboarding/alerting, release governance, cost management, and continuous improvement via postmortems.

Create a 30/60/90-day plan that proves momentum. In the first 30 days, focus on fundamentals: learn the LLM request lifecycle, implement structured logging fields, and build a baseline dashboard for latency/errors/tokens. In 60 days, add governance and playbooks: version prompts/configs, introduce a canary release, and write two runbooks plus one postmortem template. By 90 days, demonstrate impact: reduce P95 latency, decrease empty-retrieval rate, or bring spend back under budget with guardrails and caching—then document the results with before/after metrics.

For interviews, prepare “scenario + metric” stories. Expect questions like: “A user says answers became wrong after yesterday’s deploy—what do you check?” or “Cost doubled overnight—how do you investigate?” Your answers should reference specific signals (token/request, model version changes, retrieval hit rate, safety block rate), specific mitigations (rollback prompt via feature flag, switch fallback model, throttle tool calls), and a postmortem approach that emphasizes learning over blame.

  • Metrics story format: baseline → change introduced → KPI delta → investigation steps → fix → guardrail added.
  • Operational maturity signals: you can explain how approvals, audit trails, and budgets reduce risk without slowing delivery.

The practical outcome is confidence: you can articulate how your IT support workflow maps directly to AI operations responsibilities, and you can prove it with artifacts, metrics, and a plan.

Chapter milestones
  • Assemble a production-ready AI ops playbook and KPI set
  • Set governance for changes: prompt/model releases and approvals
  • Create a 30/60/90-day transition plan from IT support to AI ops
  • Build a portfolio artifact: dashboards, runbooks, and postmortem sample
  • Prepare for interviews with AI ops scenarios and metrics stories
Chapter quiz

1. In Chapter 6, what best describes “observable” for LLM production operations?

Show answer
Correct answer: Being able to tie user symptoms (wrong answers, slowness, cost spikes) to traces, structured logs, retrieval results, model versions, and token usage
The chapter defines observability as connecting tickets to concrete telemetry across logs/traces, retrieval, versions, and usage.

2. Why does Chapter 6 say “production-ready” is harder for LLM applications than traditional IT services?

Show answer
Correct answer: Because even successful releases can still harm users via hallucinations, policy violations, runaway costs, or subtle quality regressions
Beyond uptime, LLMs introduce new failure modes (quality, safety, cost) that can regress without obvious outages.

3. What is the primary purpose of governance in the chapter’s production model?

Show answer
Correct answer: Control changes to prompts, retrieval settings, and model versions through approvals, rollback plans, and audit trails without freezing delivery
Governance is change control that enables safe iteration via approvals, rollback, and auditability.

4. According to the chapter, what is the benefit of combining observability and governance?

Show answer
Correct answer: Incidents become repeatable workflows rather than heroic debugging sessions, reducing chaos
With both, teams can diagnose and manage changes systematically, making response repeatable instead of ad hoc.

5. Which set of items best matches the chapter’s recommended portfolio artifacts to demonstrate AI ops readiness?

Show answer
Correct answer: Dashboards, runbooks, and a postmortem sample
The chapter explicitly calls out dashboards, runbooks, and a postmortem sample as practical portfolio artifacts.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.