HELP

+40 722 606 166

messenger@eduailast.com

Probability & Optimization for ML: From Derivations to Code

Machine Learning — Intermediate

Probability & Optimization for ML: From Derivations to Code

Probability & Optimization for ML: From Derivations to Code

Derive core ML math, then implement it cleanly and correctly in code.

Intermediate probability · optimization · maximum-likelihood · bayesian-inference

Course Overview

Probability and optimization are the two pillars that quietly determine whether a machine learning system is robust or brittle. This book-style course teaches you to move fluently from modeling assumptions (probability) to trainable objectives (loss functions), and then to reliable training (optimization). The focus is not on memorizing formulas—it’s on understanding where they come from, how to derive them, and how to implement them without the numerical and engineering mistakes that commonly derail ML projects.

You will start with probability foundations targeted to machine learning: conditional reasoning, expectation as an operator, and the distributions you’ll use repeatedly. From there, you’ll turn probability models into estimation procedures, first via maximum likelihood and then via MAP estimation, making the connection between priors and regularization explicit. This frames ML training as a principled optimization problem rather than a collection of tricks.

Derivations That Lead to Working Code

Many learners can follow a derivation on paper but struggle to translate it into correct and stable implementations. This course repeatedly closes that gap. You’ll derive gradients for key models, learn the matrix calculus patterns that show up everywhere in ML, and practice stability techniques like the log-sum-exp trick. You’ll also learn how to verify your work with gradient checking so bugs are caught early—before they become “mystery training behavior.”

Optimization You Can Reason About

Once the objective is clear and gradients are correct, optimization determines whether training converges quickly, slowly, or not at all. You’ll build intuition for convexity, smoothness, curvature, and conditioning—concepts that explain why step sizes matter, why momentum helps, and why some problems are inherently harder to optimize. Then you’ll implement and compare practical optimizers (SGD, RMSProp, Adam), learn how schedules change outcomes, and adopt a debugging workflow for divergence, NaNs, and plateaus.

From Unconstrained to Modern Probabilistic Optimization

The final chapter expands your toolkit to constrained optimization and probabilistic objectives with latent variables. You’ll learn Lagrangians and KKT conditions for constraint handling, then connect inference procedures like EM and variational methods to optimization of lower bounds (ELBO). This gives you a coherent lens for understanding classic probabilistic ML and modern approximate inference.

What You’ll Walk Away With

  • A clear path from probability assumptions to likelihoods, posteriors, and losses
  • Reusable derivation templates for gradients and common objectives
  • Implementations you can adapt: stable losses, optimizers, and checks
  • Practical judgment for choosing optimizers and diagnosing training

If you want to strengthen the mathematical core of your ML practice—without losing sight of implementation details—this course is designed for you. You can Register free to get started, or browse all courses to compare learning paths.

What You Will Learn

  • Translate probability assumptions into ML objectives (likelihood, MAP, ELBO)
  • Derive gradients and implement them with stable vectorized code
  • Use key distributions and exponential-family identities for fast modeling
  • Apply convexity, smoothness, and conditioning concepts to choose optimizers
  • Implement and tune SGD, momentum, Adam, and learning-rate schedules
  • Diagnose optimization failures (divergence, plateaus, ill-conditioning) and fix them
  • Build regularized linear and logistic regression from derivation to implementation
  • Implement constrained optimization tools (Lagrange multipliers, KKT) for ML

Requirements

  • Comfort with basic calculus (derivatives, partial derivatives, chain rule)
  • Basic linear algebra (vectors, matrices, dot products, matrix multiplication)
  • Python proficiency (NumPy; prior PyTorch helpful but not required)
  • Familiarity with basic machine learning terminology (features, loss, training)

Chapter 1: Probability Foundations for ML Objectives

  • Map assumptions to a likelihood and a loss function
  • Compute expectations and variance with vectorized notation
  • Work with common distributions used in ML pipelines
  • Build a simple probabilistic model end-to-end in NumPy
  • Checkpoint: derive and code a Gaussian negative log-likelihood

Chapter 2: Estimation—MLE, MAP, and Regularization

  • Derive MLE for linear regression and connect it to least squares
  • Derive MAP and show how priors become regularizers
  • Implement MLE/MAP training loops with stable log-likelihoods
  • Validate estimates with diagnostics and calibration checks
  • Checkpoint: implement ridge and lasso-style penalties and compare

Chapter 3: Gradients, Matrix Calculus, and Backprop Intuition

  • Compute gradients for linear and logistic regression by hand
  • Vectorize derivatives and match them to efficient code
  • Implement gradient checking to catch silent bugs
  • Connect computational graphs to backprop and autodiff
  • Checkpoint: derive and implement softmax cross-entropy stably

Chapter 4: Optimization Basics—Convexity, Conditioning, and First-Order Methods

  • Assess convexity and pick an optimizer accordingly
  • Implement gradient descent with line search and stopping rules
  • Measure conditioning and understand its training impact
  • Use momentum to accelerate and smooth updates
  • Checkpoint: build a robust optimizer module for a toy objective

Chapter 5: Practical Optimizers—SGD, Adam, Schedules, and Stability

  • Implement SGD, RMSProp, and Adam from scratch in Python
  • Choose learning rates and schedules using measurable signals
  • Apply normalization and regularization tactics that help optimization
  • Debug divergence and NaNs with a repeatable checklist
  • Checkpoint: train logistic regression and a small MLP with tuned optimizers

Chapter 6: Constrained & Probabilistic Optimization—KKT, EM, and Variational Ideas

  • Solve constrained ML problems using Lagrangians and KKT conditions
  • Derive and implement EM for a simple latent-variable model
  • Understand ELBO and implement a minimal variational inference loop
  • Compare MLE/MAP/VI in terms of objectives and behavior
  • Capstone: end-to-end probabilistic model with an optimizer you implement

Sofia Chen

Senior Machine Learning Engineer, Probabilistic Modeling

Sofia Chen is a Senior Machine Learning Engineer specializing in probabilistic modeling, optimization, and scalable training pipelines. She has built and deployed ML systems for ranking, forecasting, and anomaly detection, with a focus on numerical stability and reproducible experimentation.

Chapter 1: Probability Foundations for ML Objectives

Machine learning training is often described as “minimizing a loss,” but the most reliable way to design that loss (and debug it) is to start from probability. In this chapter you’ll treat probability as a modeling language: you state assumptions about how data is generated, translate them into a likelihood, and then turn that likelihood into an objective you can optimize with gradients and stable code.

We will keep the focus practical: you’ll learn to express expectations and variances in vectorized notation, choose common distributions that match your pipeline (classification vs. regression vs. counts), and assemble a small end-to-end probabilistic model in NumPy. Along the way, we’ll flag the mistakes that cause real-world failures: mixing up densities and probabilities, dropping constants that aren’t constant, mis-handling shapes in vectorized code, and producing numerically unstable log-likelihoods.

By the end of this chapter, you should be able to read a modeling assumption like “noise is Gaussian with variance σ²” and immediately write down (1) the negative log-likelihood loss, (2) the gradients, and (3) a stable vectorized implementation—plus understand what each term is doing and what can go wrong when the assumptions are violated.

Practice note for Map assumptions to a likelihood and a loss function: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compute expectations and variance with vectorized notation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Work with common distributions used in ML pipelines: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a simple probabilistic model end-to-end in NumPy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: derive and code a Gaussian negative log-likelihood: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Map assumptions to a likelihood and a loss function: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compute expectations and variance with vectorized notation: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Work with common distributions used in ML pipelines: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Build a simple probabilistic model end-to-end in NumPy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: derive and code a Gaussian negative log-likelihood: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 1.1: Events, random variables, and sigma-algebra intuition

Section 1.1: Events, random variables, and sigma-algebra intuition

Probability starts with a sample space Ω (all possible outcomes) and a collection of events 𝓕 you’re allowed to assign probabilities to. In rigorous terms, 𝓕 is a sigma-algebra: it’s closed under complements and countable unions. In ML, you rarely manipulate sigma-algebras directly, but the intuition matters: it tells you what questions are “legal” to ask about your data and what it means to condition on information.

A random variable is a function from outcomes to numbers (or vectors). Your dataset row xᵢ is usually treated as a realization of a random vector X. The label yᵢ is a realization of Y. This is more than notation: it’s the bridge from raw arrays to probabilistic assumptions. When you say “Y|X is Gaussian,” you are declaring a distribution for the random variable Y conditioned on X—an explicit data generating process (DGP).

Engineering judgment: keep clear whether you are modeling a probability mass (discrete outcomes, sums) or a density (continuous outcomes, integrals). Confusing the two leads to errors like comparing densities directly across different scalings, or forgetting that a Gaussian density can exceed 1 for small variance. In code, this distinction shows up in whether you implement log-PMF or log-PDF, and whether normalization terms are required.

Common mistake: treating “random variable” as “random number generator.” In ML derivations, X and Y are symbolic objects; randomness is about the distribution you assume. In implementation, you typically work with fixed data arrays and evaluate log-likelihoods under your model. Separating “data as realized values” from “model as distribution” makes debugging much easier.

Section 1.2: Conditional probability, Bayes’ rule, and independence

Section 1.2: Conditional probability, Bayes’ rule, and independence

Most ML objectives are built from conditional models like p(y|x, θ). Conditional probability is best viewed as “updating your distribution after observing information.” Formally, p(A|B)=p(A∩B)/p(B). In modeling, B is often “X=x,” and the resulting object is a distribution over Y given inputs x.

Bayes’ rule is the workhorse connection between generative and discriminative viewpoints: p(θ|D) ∝ p(D|θ)p(θ). In optimization language, maximizing the posterior is MAP estimation: θ̂ = argmaxθ log p(D|θ) + log p(θ). A practical outcome is that priors become regularizers: a Gaussian prior θ~N(0, τ²I) produces an L2 penalty (1/2τ²)||θ||² (up to constants). This mapping is one of the most useful “probability → objective” translations in ML.

Independence assumptions control factorization and computation. For i.i.d. data, p(D|θ)=∏ᵢ p(yᵢ|xᵢ, θ). Taking logs turns products into sums: log-likelihood = ∑ᵢ log p(yᵢ|xᵢ, θ). This is why minibatch SGD works: the full gradient is a sum of per-example gradients, so you can estimate it with a subset. If your data isn’t independent (time series, graphs), blindly applying i.i.d. factorization can produce overconfident models and brittle training.

Common mistake: assuming conditional independence you didn’t earn. For example, in Naive Bayes you assert features are conditionally independent given the class. This can work surprisingly well, but it’s a modeling choice, not a theorem. The practical check is whether the resulting likelihood aligns with your pipeline and whether the loss you optimize is sensitive to the dependencies you care about.

Section 1.3: Expectation, variance, covariance, and law of total expectation

Section 1.3: Expectation, variance, covariance, and law of total expectation

Expectations turn randomness into computable summaries and are central to both probabilistic modeling and optimization. For a random variable Z with distribution p, the expectation is E[Z]=∑ z p(z) in the discrete case or ∫ z p(z) dz in the continuous case. In ML, you frequently take expectations over data distributions (generalization) or over latent variables (inference).

Vectorized notation helps you implement these ideas without loops. If X is an n×d data matrix (rows xᵢ), the empirical mean is μ̂ = (1/n)∑ᵢ xᵢ, which in NumPy is X.mean(axis=0). The empirical covariance is Σ̂ = (1/n)∑ᵢ (xᵢ-μ̂)(xᵢ-μ̂)ᵀ, implementable as (Xc.T @ Xc) / n where Xc = X - mu. Getting shapes right (d×d covariance) is not cosmetic: downstream algorithms (whitening, Gaussian models, conditioning diagnostics) depend on it.

Variance is Var(Z)=E[(Z−E[Z])²]. Covariance extends this to vectors: Cov(X)=E[(X−μ)(X−μ)ᵀ]. Practically, large variance in gradients can slow optimization; you’ll see this later as noisy SGD steps and the need for momentum or adaptive methods.

The law of total expectation is a simple but powerful identity: E[Y]=E[E[Y|X]]. It tells you that you can average in stages—first over conditional distributions, then over the marginal. This becomes crucial when building models with latent variables, where you might compute an inner expectation analytically (or approximately), then outer-average over data. A related tool, the law of total variance, decomposes uncertainty into “within-group” and “between-group” components—useful for diagnosing why a model is overconfident or underfit.

Common mistake: mixing up population moments (under the true distribution) with empirical estimates (from your sample). In code, be explicit: are you computing a statistic from your dataset, or an expectation under your model p(·|θ)? Confusing them often leads to incorrect gradients or double-counting terms in objectives.

Section 1.4: Common distributions (Bernoulli, Binomial, Gaussian, Categorical)

Section 1.4: Common distributions (Bernoulli, Binomial, Gaussian, Categorical)

Most ML pipelines repeatedly rely on a small set of distributions. Choosing one is not “math decoration”; it determines the loss function, the gradient scale, and what kinds of errors the model treats as plausible.

Bernoulli models binary outcomes y∈{0,1}: p(y|π)=π^y(1−π)^(1−y). If you parameterize π via a logistic link π=σ(η), the negative log-likelihood becomes the familiar logistic (cross-entropy) loss. In implementation, don’t compute σ then log; use stable forms like logaddexp to avoid overflow when η has large magnitude.

Binomial extends Bernoulli to counts of successes in N trials. It’s common in click modeling (“k clicks out of N impressions”) and lets you weight examples naturally by N. The log-likelihood includes a combinatorial term log C(N,k), which is constant w.r.t. parameters if N and k are fixed, but matters for evaluating calibrated probabilities.

Gaussian is the default for real-valued regression residuals: y|μ,σ² ~ N(μ,σ²). The negative log-likelihood is proportional to squared error when σ² is fixed, and becomes a heteroscedastic loss when σ² depends on x. A key engineering insight: modeling σ² can prevent the model from chasing noise, but it can also cause numerical issues if σ² collapses toward 0—so you enforce positivity with softplus and add lower bounds.

Categorical models one-of-K outcomes with probabilities π₁…π_K. Combined with a softmax parameterization, its negative log-likelihood is multiclass cross-entropy. The stability rule here is standard: subtract the maximum logit before exponentiating (z - z.max(axis=1, keepdims=True)) and compute log-softmax with logsumexp.

Practical outcome: once you can write down the log-likelihood of these distributions, you can immediately obtain an ML objective by summing over data, and you can implement it efficiently with vectorization. This is the core habit you will reuse across the course.

Section 1.5: Transformations, change of variables, and Jacobians (practical view)

Section 1.5: Transformations, change of variables, and Jacobians (practical view)

Transformations show up everywhere in ML: you map unconstrained parameters to constrained spaces (variance must be positive, probabilities must sum to 1), and you reparameterize random variables to make optimization easier. The change-of-variables rule is the accounting system that keeps your densities correct.

If Z has density p_Z(z) and you define Y=g(Z) with an invertible, differentiable g, then p_Y(y)=p_Z(g⁻¹(y)) |det J_{g⁻¹}(y)|. In practice, you often implement this in log space: log p_Y(y)=log p_Z(z) + log|det J_{g⁻¹}(y)|. This matters whenever you build “transformed distributions,” normalizing flows, or even simple constrained parameterizations where you want a proper likelihood.

Two ML-relevant examples: (1) Positive scale via σ=softplus(s) or σ=exp(s). If you treat σ as a deterministic parameter transform, you typically optimize over s and plug σ into the likelihood; you do not add a Jacobian term because you are not transforming a random variable, you are reparameterizing a parameter. But if σ itself were modeled as a random variable with a prior, then the transformation affects the prior density and you must include the Jacobian. (2) Simplex constraints via softmax. Softmax maps logits to probabilities; again, as a parameterization it doesn’t require a Jacobian term in the likelihood, but it determines gradients and numerical stability.

Common mistake: applying change-of-variables where it doesn’t belong (parameter reparameterization) or forgetting it where it does (density transformation). A practical debugging approach is to ask: “Am I transforming a random variable with a density, or just choosing a coordinate system for parameters?” This single question prevents many silent probabilistic errors.

Transformations are also optimization tools. Reparameterizations can improve conditioning: optimizing log-variance instead of variance avoids negative values and often yields smoother gradients. Later chapters will connect this to optimizer choice and learning-rate sensitivity.

Section 1.6: Probability as modeling language: from data generating process to loss

Section 1.6: Probability as modeling language: from data generating process to loss

Here is the practical workflow you should internalize: (1) specify a data generating process, (2) write the likelihood, (3) take logs and negate to get a loss, (4) derive gradients, (5) implement with stable vectorized code, and (6) validate by sanity checks (shapes, finite values, gradient checks, and behavior on synthetic data).

Example end-to-end model (NumPy): Suppose we model real-valued targets with Gaussian noise: yᵢ = wᵀxᵢ + b + εᵢ, εᵢ~N(0, σ²). Then p(yᵢ|xᵢ,w,b,σ²)=N(yᵢ; wᵀxᵢ+b, σ²). For i.i.d. data, the negative log-likelihood (dropping constants that do not depend on parameters, but keeping σ terms) is: L(w,b,σ²)= (n/2)log σ² + (1/(2σ²))∑ᵢ (yᵢ−(Xw+b))². If σ² is fixed, minimizing L is equivalent to minimizing mean squared error; if σ² is learned, the log σ² term prevents σ² from shrinking to 0 without penalty.

Stable implementation pattern: keep computations in log space where possible, avoid explicit inverses, and vectorize. Compute residuals r = y - (X @ w + b). For a scalar variance, use inv_var = np.exp(-log_var) where log_var is unconstrained. Then: loss = 0.5 * n * log_var + 0.5 * inv_var * (r @ r). This avoids negative variances and is stable for small/large σ².

Checkpoint (derive and code Gaussian NLL): The gradient w.r.t. w is ∂L/∂w = -(1/σ²) Xᵀ r (signs matter; r is y - ŷ). In code, grad_w = -(inv_var) * (X.T @ r). The gradient w.r.t. b is grad_b = -(inv_var) * r.sum(). For log_var, differentiate through σ²=exp(log_var): ∂L/∂log_var = 0.5*n - 0.5*inv_var*(r@r). This is a common place to make an error by differentiating w.r.t. σ instead of σ², or by missing the chain rule.

From assumptions to other objectives: If you add a prior p(w) (e.g., Gaussian), you get MAP: minimize NLL + regularizer. If you introduce latent variables z, the log-likelihood log p(x)=log ∫ p(x,z) dz can be intractable; you will later replace it with an ELBO that contains expectations under an approximate posterior. The same principle holds: probability assumptions determine the objective, and the objective determines which gradients and optimizers you need.

Common failure modes at this stage are numerical (NaNs from log(0), overflow in exp), statistical (mismatched noise model, leading to outlier sensitivity), and implementation-related (silent broadcasting errors). Treat your likelihood as production code: write it to be stable, test it on toy data with known parameters, and verify gradients before moving on to large models.

Chapter milestones
  • Map assumptions to a likelihood and a loss function
  • Compute expectations and variance with vectorized notation
  • Work with common distributions used in ML pipelines
  • Build a simple probabilistic model end-to-end in NumPy
  • Checkpoint: derive and code a Gaussian negative log-likelihood
Chapter quiz

1. In this chapter’s workflow, what is the most reliable path from a modeling assumption to a training objective?

Show answer
Correct answer: State data-generation assumptions → write a likelihood → turn it into a (negative) log-likelihood loss to optimize
The chapter emphasizes probability as a modeling language: assumptions define a likelihood, which becomes an objective (typically via negative log-likelihood).

2. Which choice best reflects the chapter’s warning about “dropping constants” when deriving a loss from a likelihood?

Show answer
Correct answer: You can drop only terms that are constant with respect to the parameters being optimized; otherwise you may change the objective
Some terms (e.g., involving σ²) may look constant but aren’t if σ is a parameter; dropping them can change the optimized solution.

3. What common failure does the chapter highlight when translating probabilistic expressions into code?

Show answer
Correct answer: Producing numerically unstable log-likelihood computations
The chapter explicitly flags numerically unstable log-likelihoods as a frequent real-world source of bugs.

4. Why does the chapter stress expressing expectations and variances in vectorized notation?

Show answer
Correct answer: To compute these quantities efficiently and correctly in array-based code while avoiding shape/axis mistakes
Vectorized notation maps directly to NumPy-style implementations and helps prevent shape mis-handling in practice.

5. A modeling assumption says: “noise is Gaussian with variance σ².” According to the chapter’s end-of-chapter goal, what should you be able to write down and implement next?

Show answer
Correct answer: The negative log-likelihood loss, its gradients, and a stable vectorized implementation
The chapter’s stated outcome is to translate such an assumption into the NLL, gradients, and stable vectorized code.

Chapter 2: Estimation—MLE, MAP, and Regularization

Training a machine learning model is, at its core, an estimation problem: you observe data and choose parameters that make those observations “most plausible” under your assumptions. This chapter turns that statement into concrete objectives you can derive, optimize, and debug in code. We start from likelihood and log-likelihood, derive maximum likelihood estimation (MLE) for linear regression and connect it directly to least squares, then extend the same framework to maximum a posteriori (MAP) estimation where priors become regularizers.

Along the way, we will focus on engineering judgment: how to write stable, vectorized log-likelihoods; how to interpret regularization strengths; how to detect failure modes like divergence, ill-conditioning, or miscalibrated predictive uncertainty; and how to implement ridge and lasso-style penalties so you can compare behavior on real datasets. The goal is not only to “know the math,” but to be able to translate probability assumptions into objectives, implement the gradients safely, and validate that your estimates make sense.

A recurring theme is that your objective function is a contract between modeling and optimization. Modeling choices (noise distribution, prior) determine the shape of the objective; optimization choices (step size, momentum, Adam, schedules) determine whether you actually reach a good solution. You will use both sides of the contract when diagnosing training issues and when deciding how to regularize for generalization.

Practice note for Derive MLE for linear regression and connect it to least squares: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Derive MAP and show how priors become regularizers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement MLE/MAP training loops with stable log-likelihoods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Validate estimates with diagnostics and calibration checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: implement ridge and lasso-style penalties and compare: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Derive MLE for linear regression and connect it to least squares: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Derive MAP and show how priors become regularizers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement MLE/MAP training loops with stable log-likelihoods: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Validate estimates with diagnostics and calibration checks: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 2.1: Likelihood, log-likelihood, and why logs win numerically

Suppose you have data D = {(x_i, y_i)}_{i=1..n} and a parametric model p(y|x, θ). The likelihood L(θ) is the probability of the observed data as a function of parameters: L(θ)=∏_i p(y_i|x_i, θ) (assuming conditional independence). Estimation begins by choosing θ to maximize L(θ) or, equivalently, the log-likelihood ℓ(θ)=∑_i log p(y_i|x_i, θ).

Logs win for two reasons. First, products of many probabilities underflow quickly in floating-point. A typical p(y_i|x_i, θ) might be ~1e-3; multiplying 10,000 such terms is effectively zero. Summing logs keeps values in a representable range. Second, logs turn products into sums, which makes gradients simpler and more numerically stable. This matters when you implement vectorized training loops.

  • Engineering rule: write objectives in log space, and use numerically stable building blocks (e.g., log-sum-exp for softmax likelihoods; stable sigmoid cross-entropy functions).
  • Gradient rule: differentiate the log-likelihood, not the likelihood. You avoid multiplying tiny numbers and you get additive structure.

In code, you usually implement the negative log-likelihood (NLL) because optimizers minimize: NLL(θ)=−ℓ(θ). For Gaussian regression, NLL becomes a sum of squared errors plus constants; for classification, NLL becomes cross-entropy. Always separate “model output” from “likelihood computation”: compute logits or predictions first, then compute a stable NLL via library routines or carefully written formulas. Common mistakes include forgetting the log, averaging vs summing inconsistently (which changes the effective learning rate), and mixing units (e.g., applying regularization to a loss that is averaged per-example without scaling the regularizer accordingly).

Finally, remember that likelihood is conditional on your modeling assumptions. If the assumed noise distribution is wrong (e.g., heavy-tailed residuals but you used Gaussian), MLE can become overly sensitive to outliers. This is not an optimization bug—it is a modeling choice that changes the objective’s geometry and robustness.

Section 2.2: Maximum likelihood estimation and identifiability

MLE chooses θ̂ = argmax_θ ℓ(θ). For linear regression, assume y_i = w^T x_i + b + ε_i with ε_i ~ N(0, σ^2). Then p(y_i|x_i, w, b) is Gaussian with mean w^T x_i + b. The log-likelihood (up to constants) is −(1/(2σ^2))∑_i (y_i − (w^T x_i + b))^2. Maximizing this is equivalent to minimizing the sum of squared residuals. This is the clean connection: least squares is MLE under Gaussian noise.

In matrix form with design matrix X (n×d), parameters w (d×1), targets y (n×1), and optionally an intercept term absorbed by augmenting X with a column of ones, the MLE solves min_w ||y − Xw||^2. If X has full column rank, the closed-form solution is ŵ = (X^T X)^{-1} X^T y. In practice, you rarely invert explicitly; you use a solver (Cholesky, QR) for numerical stability.

  • Identifiability: if X^T X is singular (e.g., duplicated features, d>n, perfect multicollinearity), there are infinitely many minimizers. Optimization may “work” but your solution depends on initialization and implicit biases (e.g., minimum-norm solutions with certain optimizers).
  • Conditioning: even when invertible, a poorly conditioned X^T X leads to slow convergence for gradient methods and sensitivity to noise. Feature scaling and regularization are practical fixes.

Implementing MLE as gradient descent is straightforward and useful because it generalizes beyond closed-form models. For squared loss, the gradient is ∇_w (1/2n)||y − Xw||^2 = −(1/n) X^T (y − Xw). Vectorize it: compute residual r = Xw − y, then grad = (1/n) X^T r. Use stable data types (float32 is fine with scaling; float64 can help diagnostics). A common mistake is mixing shapes and accidentally computing elementwise products instead of matrix multiplies—tests with small synthetic data (where you know the solution) catch this early.

Diagnose identifiability by inspecting rank, singular values, or by watching training: if loss decreases but parameters explode or drift without stabilizing, you may have an underdetermined system or an overly large learning rate. This is where probability assumptions meet optimization reality: the objective can be flat in some directions, and your optimizer will reveal it.

Section 2.3: Maximum a posteriori estimation and conjugate priors (overview)

MAP estimation incorporates prior beliefs about parameters. Instead of maximizing p(D|θ), MAP maximizes the posterior p(θ|D) ∝ p(D|θ)p(θ). Taking logs: θ̂_MAP = argmax_θ [ℓ(θ) + log p(θ)]. Equivalently, minimize NLL plus a penalty term: −ℓ(θ) − log p(θ). This is the core bridge between Bayesian thinking and everyday regularization.

Conjugate priors are priors that yield posteriors in the same family, simplifying analysis. You do not need conjugacy to train with MAP (gradient-based optimization works regardless), but conjugacy gives intuition and sometimes closed forms. Examples: a Gaussian prior on a Gaussian mean; a Gamma prior on a Gaussian precision (inverse variance); a Beta prior on a Bernoulli probability. For linear regression with Gaussian noise and a Gaussian prior on weights w ~ N(0, τ^2 I), the MAP objective becomes squared error plus an L2 penalty on w.

  • Interpretation: the prior adds “pseudo-data” or preference for certain parameter values (often small magnitude), which stabilizes estimation when data is scarce or features are correlated.
  • Scaling: be careful about whether your loss is summed over n examples or averaged. The prior term does not automatically scale with n; you must choose a convention and keep it consistent when tuning regularization.

In code, MAP training often looks identical to MLE training, except you add a penalty term and its gradient. This is valuable: you can keep the same stable log-likelihood implementation, then plug in different priors/penalties. When you later encounter variational objectives (ELBO), the pattern repeats: likelihood-like terms plus regularizing/KL-like terms.

Common mistakes include interpreting the penalty coefficient without reference to noise scale (σ^2) and data scaling. A Gaussian prior strength is meaningful only relative to the likelihood curvature; standardizing features and using consistent loss normalization makes regularization tuning far more predictable.

Section 2.4: Regularization as prior knowledge (L2, L1, elastic net)

Regularization is MAP in disguise when the penalty corresponds to −log p(θ). The practical viewpoint: regularization reshapes the objective so that optimization is better conditioned and generalization improves. The modeling viewpoint: it encodes preference for simpler parameterizations.

L2 (ridge): add (λ/2)||w||^2. This corresponds to a zero-mean isotropic Gaussian prior on w. Ridge is smooth and convex, so gradients are easy: add λw to the gradient. Ridge also improves conditioning by making X^T X + λI invertible, which resolves identifiability issues in linear regression and stabilizes optimization in high dimensions.

L1 (lasso): add λ||w||_1, corresponding to a Laplace prior. L1 encourages sparsity (many weights exactly zero) but is not differentiable at 0. In practice you use subgradients, proximal methods (soft-thresholding), or rely on optimizers that can handle non-smoothness via proximal steps. If you naïvely apply standard gradient descent with an arbitrary “sign” at 0, you can get jitter around zero and slow convergence.

Elastic net: add λ1||w||_1 + (λ2/2)||w||^2. This combines sparsity with ridge-style stability, often preferable when features are correlated: pure L1 may pick one feature arbitrarily among a correlated group, while elastic net tends to share weight more sensibly.

  • Checkpoint implementation goal: implement ridge and lasso-style penalties in the same training loop and compare coefficient paths, sparsity, and validation error as you vary λ.
  • Do not regularize the intercept by default. Regularizing b can introduce bias that is rarely intended; handle it separately.

From an engineering perspective, treat regularization as part of the objective definition, not an afterthought. Keep penalties explicit in your loss function, log them separately (data loss vs penalty), and ensure gradients match. A very common bug is accidentally applying weight decay (L2) twice: once via an explicit penalty and once via optimizer settings (e.g., AdamW). Decide on one method and verify by checking the magnitude of parameter updates.

Section 2.5: Bias–variance trade-off and generalization connections

Why does regularization help on test data even though it worsens the training optimum? Because we care about generalization, not maximizing likelihood on the observed sample alone. In classical terms, regularization increases bias (parameters are pulled toward the prior/zero) but reduces variance (parameters fluctuate less across different samples). The net effect can reduce expected test error.

In linear regression, this trade-off is visible: with many features or noisy targets, the unregularized least-squares solution can have large coefficients that fit noise. Ridge shrinks coefficients, which may slightly increase training error but typically decreases test error. Lasso can further improve interpretability and sometimes prediction by removing irrelevant features.

  • Practical workflow: choose a metric (e.g., RMSE, log-likelihood), split data, tune λ via validation (or cross-validation), then refit on train+val with the chosen λ and evaluate once on test.
  • Calibration check: if your model outputs probabilistic predictions, evaluate whether predicted uncertainty matches empirical error (e.g., standardized residuals for Gaussian regression, reliability curves for classification). MLE/MAP can be well-optimized but still miscalibrated if the noise model is wrong.

Regularization also interacts with optimization. Strong L2 makes the objective more strongly convex and smoother, improving conditioning and often allowing larger learning rates. Conversely, L1 introduces non-smoothness; you may need smaller steps or a proximal optimizer. When training “fails,” ask whether it is a generalization issue (overfitting) or an optimization issue (divergence/plateau). They can look similar on validation curves, but training curves and gradient norms usually differentiate them.

Finally, connect this to probabilistic assumptions: choosing a prior is not merely a hack. It is a statement about plausible parameter scales. If you standardize features, a Gaussian prior with a single τ becomes meaningful across dimensions; without scaling, the same τ implies very different beliefs per feature and makes tuning feel arbitrary.

Section 2.6: Practical estimation: constraints, initialization, and sanity checks

Estimation becomes practical when you can trust the output. That trust comes from constraints, initialization choices, and sanity checks that catch silent failures early.

Constraints: some parameters must be positive (variances), probabilities must lie in [0,1], and covariance matrices must be PSD. Encode constraints by reparameterization (e.g., σ = softplus(s) + ε) rather than clamping, which can create zero gradients. In MAP/MLE loops, constraints prevent invalid log-likelihood values (NaNs from log of negative numbers) and improve optimizer behavior.

Initialization: for linear regression, initializing w=0 is often fine after feature standardization. For more complex likelihoods (logistic regression, Poisson), poor initialization can produce saturated probabilities and tiny gradients. A practical trick is to initialize biases to match marginal target rates (e.g., logit of positive class frequency) so the first gradients are informative.

  • Stable log-likelihood implementation: compute per-example log-probabilities with stable primitives, then reduce (sum/mean). Track both the data term and the regularizer term.
  • Diagnostics: monitor loss, gradient norms, parameter norms, and (for probabilistic models) average predicted variance/probabilities. Sudden spikes suggest learning-rate issues; steadily growing parameter norms with flat loss suggests identifiability or missing regularization.
  • Sanity checks: fit on a tiny dataset and confirm you can overfit (loss near zero for squared loss). Shuffle labels; performance should drop to chance. Compare against a closed-form ridge solution on small problems to validate your gradient code.

Calibration and residual analysis are the final step. For Gaussian regression, plot residuals vs predictions to see heteroscedasticity (variance changing with x). If residual tails are heavy, consider robust likelihoods (e.g., Student-t) rather than forcing Gaussian MLE. If probabilities are overconfident, consider stronger regularization, better features, or a better likelihood model.

By the end of this chapter, you should be able to derive MLE and MAP objectives from probability assumptions, implement stable vectorized losses, add ridge/lasso/elastic-net penalties correctly, and validate estimates with targeted diagnostics. Those skills make the next steps—choosing optimizers, tuning schedules, and diagnosing conditioning—far more systematic rather than trial-and-error.

Chapter milestones
  • Derive MLE for linear regression and connect it to least squares
  • Derive MAP and show how priors become regularizers
  • Implement MLE/MAP training loops with stable log-likelihoods
  • Validate estimates with diagnostics and calibration checks
  • Checkpoint: implement ridge and lasso-style penalties and compare
Chapter quiz

1. In this chapter’s framework, why does deriving MLE for linear regression connect directly to least squares?

Show answer
Correct answer: Because maximizing a Gaussian noise log-likelihood is equivalent to minimizing the sum of squared residuals
Under a common assumption of Gaussian observation noise, the negative log-likelihood becomes a squared-error objective, yielding least squares.

2. What is the key conceptual change when moving from MLE to MAP estimation in this chapter?

Show answer
Correct answer: You add a prior term so maximizing posterior probability becomes likelihood plus a regularization-like penalty
MAP incorporates a prior over parameters; in the objective, this appears as an additional term that acts like a regularizer.

3. The chapter emphasizes writing stable log-likelihood code. What is the main practical reason for using log-likelihoods in training loops?

Show answer
Correct answer: They turn products of probabilities into sums, reducing numerical underflow and improving stability
Log-likelihoods convert multiplicative probability terms into additive ones, which is more numerically stable and easier to optimize.

4. The chapter describes the objective function as a “contract” between modeling and optimization. Which pairing best matches that idea?

Show answer
Correct answer: Modeling choices determine the objective’s shape; optimization choices determine whether you reach a good solution
Noise model and prior define the objective; the optimizer and its hyperparameters determine whether training successfully finds a good minimum.

5. When implementing ridge and lasso-style penalties as discussed in the chapter, what is the most accurate high-level interpretation of the regularization strength?

Show answer
Correct answer: It controls how strongly the estimate is pulled toward simpler parameter values, affecting generalization
Regularization strength sets how heavily the penalty influences the fit, shaping parameter estimates and often improving generalization.

Chapter 3: Gradients, Matrix Calculus, and Backprop Intuition

This chapter is the bridge between “I know the objective” and “I can optimize it reliably in code.” In machine learning, we rarely minimize a function by inspection; we minimize it by repeatedly asking the same question: which direction decreases the objective fastest? The answer is the gradient. But to use gradients well, you need three complementary skills: (1) derive them cleanly for common models, (2) vectorize them so your implementation matches the math and runs fast, and (3) debug them when they’re wrong—because gradient bugs can silently ruin training.

We will compute gradients for linear and logistic regression by hand, connect the algebra to matrix calculus rules you will reuse constantly, and then reframe all of it as backpropagation on a computational graph. You’ll also implement stable softmax cross-entropy using the log-sum-exp trick—a “checkpoint” derivation that shows the difference between correct theory and robust engineering. Finally, you’ll learn gradient checking with finite differences, when it works, and when it gives false confidence.

Practical outcome: after this chapter, you should be able to look at a loss function, derive a vectorized gradient, implement it in a few lines without loops, and verify correctness before you ever start tuning optimizers.

Practice note for Compute gradients for linear and logistic regression by hand: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Vectorize derivatives and match them to efficient code: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement gradient checking to catch silent bugs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Connect computational graphs to backprop and autodiff: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: derive and implement softmax cross-entropy stably: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compute gradients for linear and logistic regression by hand: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Vectorize derivatives and match them to efficient code: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement gradient checking to catch silent bugs: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Connect computational graphs to backprop and autodiff: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: derive and implement softmax cross-entropy stably: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 3.1: Gradient, Jacobian, Hessian—what each tells you in ML

The gradient is the first-order sensitivity of a scalar objective with respect to parameters. If your loss is L(θ) (scalar) and θ ∈ R^d, then ∇_θ L ∈ R^d. In optimization, the gradient gives the steepest ascent direction; -∇L gives steepest descent. Engineering judgement: gradients also tell you about scale. If gradients are consistently tiny, your learning rate may be too small—or your model may be saturating (common with sigmoid/tanh). If gradients explode, you may have a numerical issue, poor conditioning, or an unstable parameterization.

The Jacobian generalizes this to vector-valued functions. If f(θ) ∈ R^m, then J = ∂f/∂θ ∈ R^{m×d}. In ML, Jacobians show up when you compose layers: each layer maps a vector to a vector. Backprop uses Jacobian-vector products (or vector-Jacobian products) without ever materializing full Jacobian matrices—because they can be enormous. When you hear “efficient backprop,” it’s really about computing these products in linear time.

The Hessian is the matrix of second derivatives: H = ∇^2_θ L ∈ R^{d×d}. Conceptually, the Hessian describes local curvature. Practically, it tells you about conditioning: if eigenvalues vary widely, the loss valley is narrow in some directions and flat in others, and plain gradient descent will zig-zag. You usually won’t compute full Hessians for deep nets, but thinking in terms of curvature helps you choose optimizers (momentum/Adam), learning rates, and normalization strategies.

  • Gradient: “Which way down?”
  • Jacobian: “How do intermediate vectors change?”
  • Hessian: “How curved is the surface, and is it ill-conditioned?”

Common mistake: mixing shapes. Write shapes in the margin. If X is (n×d) and w is (d), then Xw is (n). If your derivative produces the wrong shape, something is off—often a missing transpose or an unintended broadcasting rule in code.

Section 3.2: Matrix calculus rules used constantly in derivations

Matrix calculus becomes manageable when you commit to a convention and a few reusable identities. In this course we treat gradients of scalar functions with respect to vectors as column-shaped (even if your code uses 1-D arrays). This keeps transposes predictable.

Rules you’ll use repeatedly:

  • Derivative of a linear form: if f(w)=a^T w, then ∇_w f = a.
  • Quadratic form: if f(w)=\tfrac12 \|Xw-y\|^2 = \tfrac12 (Xw-y)^T (Xw-y), then ∇_w f = X^T(Xw-y). The 1/2 is not aesthetic—it cancels the 2 from differentiating the square.
  • Chain rule (vector form): if z = Xw and L = g(z) where L is scalar, then ∇_w L = X^T ∇_z L. This is the backbone of backprop in linear layers.
  • Elementwise nonlinearity: if z = σ(a) elementwise, then ∇_a L = ∇_z L ⊙ σ'(a) where is Hadamard product.

Vectorization workflow: start from per-example derivatives, then stack them. For a dataset {(x_i,y_i)}, you can often derive ∂ℓ_i/∂w and then sum: ∇L = Σ_i ∇ℓ_i. Once you recognize the pattern, replace sums with matrix products (e.g., X^T r where r is a residual vector). This not only speeds up code but reduces indexing bugs.

Common mistake: accidentally using X when you need X^T. A quick sanity check is to verify output dimensions: the gradient must have the same shape as the parameter. Another practical check is numerical scale: for mean losses, prefer dividing by n in the loss and the gradient so learning-rate tuning is more stable across batch sizes.

Section 3.3: Logistic regression derivation: sigmoid, log-loss, and gradients

Logistic regression is the canonical example where probability assumptions become an objective and then become gradients. Model: for input x ∈ R^d, predict probability of class 1 as p = σ(z) with z = w^T x + b and σ(z)=1/(1+e^{-z}). Likelihood for a label y ∈ {0,1} is p(y|x)=p^y(1-p)^{1-y}. Negative log-likelihood (log-loss) for one example:

ℓ(w,b) = -[ y log p + (1-y) log(1-p) ].

Differentiate cleanly by using a key identity: σ'(z)=σ(z)(1-σ(z)). First compute ∂ℓ/∂z. A standard result (worth deriving once) is:

∂ℓ/∂z = p - y.

Then apply the chain rule. Since z = w^T x + b, we have ∂z/∂w = x and ∂z/∂b = 1. So:

  • ∂ℓ/∂w = (p - y) x
  • ∂ℓ/∂b = (p - y)

Vectorize over n examples. Let X be (n×d), w be (d), b scalar, z = Xw + b (broadcast), p = σ(z), and y be (n). For the mean loss L = (1/n) Σ ℓ_i, the gradients are:

∇_w L = (1/n) X^T (p - y), and ∂L/∂b = (1/n) Σ_i (p_i - y_i).

Engineering notes: implement σ carefully for large magnitudes. For large positive z, exp(-z) underflows safely to 0; for large negative z, exp(-z) overflows. A stable sigmoid can be written with a conditional or by using logaddexp-based formulations for the loss. Also, always include regularization explicitly in both loss and gradient (e.g., add λw to ∇_w for L2) and ensure you do not regularize the bias unless you intend to.

Section 3.4: Softmax, log-sum-exp trick, and stable cross-entropy

Multiclass classification replaces the sigmoid with softmax. For logits z ∈ R^K, softmax probabilities are p_k = exp(z_k) / Σ_j exp(z_j). With one-hot label y, cross-entropy loss is ℓ = - Σ_k y_k log p_k (equivalently -log p_{true}).

The numerical trap is exp(z). If any logit is large (say 100), exp(100) overflows in float32/float64. The standard fix is the log-sum-exp trick, using the identity:

log Σ_j exp(z_j) = m + log Σ_j exp(z_j - m), where m = max_j z_j.

Stable implementation for a batch: logits Z shape (n×K). Compute M = max(Z, axis=1, keepdims=True), then logsumexp = M + log( Σ exp(Z-M) ). The per-example loss can be written without explicitly forming p first:

ℓ_i = - z_{i,y_i} + logsumexp_i.

This form is both stable and fast. It also makes the gradient derivation clean. For softmax + cross-entropy, a “miracle” simplification occurs:

∂ℓ_i/∂z_i = p_i - y_i (vector of length K).

For a linear classifier Z = XW + b with W ∈ R^{d×K}, the vectorized gradients for mean loss are:

  • ∇_W L = (1/n) X^T (P - Y)
  • ∂L/∂b = (1/n) Σ_i (P_i - Y_i)

Where P is softmax probabilities and Y is one-hot. Practical outcome: this matches the logistic regression gradient pattern exactly—residuals times inputs—just in matrix form. Common mistakes: forgetting to subtract max per row (using a global max is wrong), mixing integer labels with one-hot matrices incorrectly, and averaging inconsistently (loss averaged, gradient summed). Keep loss/gradient scaling consistent so learning rates transfer across batch sizes.

Section 3.5: Gradient checking with finite differences (and pitfalls)

Even with correct-looking math, implementations fail via off-by-one indexing, wrong broadcasting, missing averaging, or transposed matrices. Gradient checking catches these bugs early by comparing your analytic gradient to a numerical approximation.

For a scalar loss L(θ), the central difference approximation for component k is:

ĝ_k = [L(θ + ε e_k) - L(θ - ε e_k)] / (2ε).

Use central differences (not forward differences) for better accuracy. Choose ε around 1e-5 to 1e-4 in float64; too small causes catastrophic cancellation, too large measures curvature rather than slope. Compare with a relative error such as:

rel_err = ||g - ĝ|| / max(1, ||g||, ||ĝ||).

  • Start with tiny models and small random data; run in float64 for checks.
  • Check a random subset of parameters (e.g., 20 indices) to keep it fast.
  • Freeze randomness: turn off dropout, data augmentation, and sampling noise.

Pitfalls: (1) nondifferentiable points (ReLU at 0) where finite differences may disagree with your chosen subgradient; (2) batchnorm/train-mode behavior where the loss depends on batch statistics; (3) regularization terms forgotten in either loss or gradient; (4) evaluating the loss with different averaging conventions than the gradient. Also, gradient checking verifies your code matches your loss, not that your loss is the right objective—so use it as a debugging tool, not a guarantee of model quality.

Section 3.6: Autodiff mental model: computational graphs and chain rule

Backpropagation is the chain rule applied to a computational graph. Each node is an operation (matrix multiply, add, sigmoid, log-sum-exp), and edges carry values forward. In reverse-mode autodiff, you compute the loss once (forward pass) and then propagate sensitivities backward (reverse pass).

Mental model: every intermediate tensor v gets an associated “adjoint” \bar{v} = ∂L/∂v. The reverse pass accumulates contributions: if v feeds multiple downstream operations, their gradients add. This is why frameworks talk about “accumulating gradients.”

Example pattern that appears everywhere: z = Xw. In forward pass you compute z. In backward pass, given \bar{z} (same shape as z), you get:

  • \bar{w} = X^T \bar{z}
  • \bar{X} = \bar{z} w^T (rarely needed unless you backprop into inputs)

This matches the matrix calculus rule from Section 3.2 and explains why the transpose appears: it’s the linear map that routes gradients back through the multiplication. For elementwise nonlinearities (sigmoid, ReLU), the backward rule is just multiplication by the local derivative, which is cheap and parallelizable.

Engineering judgement: understanding the graph helps you debug shape issues and performance. If your implementation uses explicit loops over examples, you are effectively building a graph with repeated scalar operations—slow and error-prone. Vectorized code builds one batched graph and lets the backend fuse operations. Also, stable losses (like log-sum-exp cross-entropy) are not just about avoiding overflow; they improve gradient signal quality by preventing NaNs that poison the entire backward pass.

Practical outcome: you should be able to derive gradients “by hand” for core models, then trust autodiff for complex architectures—while still diagnosing failures by inspecting which nodes produce extreme values or zero gradients.

Chapter milestones
  • Compute gradients for linear and logistic regression by hand
  • Vectorize derivatives and match them to efficient code
  • Implement gradient checking to catch silent bugs
  • Connect computational graphs to backprop and autodiff
  • Checkpoint: derive and implement softmax cross-entropy stably
Chapter quiz

1. Why does this chapter emphasize vectorizing gradients (writing derivatives in matrix form) rather than implementing them with explicit loops?

Show answer
Correct answer: Vectorized gradients align the math with efficient code and reduce opportunities for indexing/broadcasting bugs
The chapter frames vectorization as the bridge from derivations to reliable, fast implementations, while also reducing common implementation mistakes.

2. What is the key purpose of gradient checking with finite differences in the workflow described in the chapter?

Show answer
Correct answer: To verify that an implemented gradient matches the loss by approximating derivatives numerically, catching silent bugs before tuning optimizers
Gradient checking is presented as a debugging tool to confirm correctness of your gradient implementation before serious optimization.

3. How does the chapter connect computational graphs to backpropagation/autodiff?

Show answer
Correct answer: Backprop computes gradients by systematically applying local derivative rules through the computational graph
The chapter reframes gradient derivations as backprop on a graph: compose local derivatives to get gradients efficiently.

4. What problem is the log-sum-exp trick addressing in the chapter’s stable softmax cross-entropy checkpoint?

Show answer
Correct answer: Numerical instability from large or small exponentials when computing softmax/log probabilities
Stable softmax cross-entropy is highlighted as robust engineering: avoid overflow/underflow by computing log-sum-exp stably.

5. Why can gradient checking sometimes give “false confidence,” according to the chapter summary?

Show answer
Correct answer: Finite-difference checks can be misleading in some situations, so passing them doesn’t guarantee the implementation is correct in all cases
The chapter notes limits of finite-difference checks: they can fail to reveal certain issues, so they should be used thoughtfully.

Chapter 4: Optimization Basics—Convexity, Conditioning, and First-Order Methods

Training an ML model is usually “just” minimizing a function, but the details of that function decide whether your optimization will be reliable, fast, and numerically stable. In earlier chapters you translated probability assumptions into objectives like negative log-likelihood, MAP, or ELBO. This chapter focuses on what happens next: you now have an objective, and you must pick an optimizer and make it work in code.

The key engineering mindset is to treat optimization as a system: objective geometry (convexity, curvature), algorithm choice (GD, SGD, momentum), and implementation details (vectorization, stable numerics, stopping rules). The same gradient formula can behave very differently depending on step size, conditioning, and noise. When training fails—divergence, plateaus, exploding loss, or “it only learns with a tiny learning rate”—you will debug by mapping symptoms back to these concepts.

We will build a practical workflow: (1) write the optimization problem precisely, including constraints and regularization; (2) assess convexity to understand what guarantees you can rely on; (3) reason about smoothness to pick step sizes and line searches; (4) measure conditioning to predict slow directions; (5) implement gradient descent variants with minibatching and robust stopping; and (6) add momentum to accelerate while damping noise. By the end, you should be able to implement and tune first-order optimizers and diagnose common failure modes with clear corrective actions.

Practice note for Assess convexity and pick an optimizer accordingly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement gradient descent with line search and stopping rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Measure conditioning and understand its training impact: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Use momentum to accelerate and smooth updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: build a robust optimizer module for a toy objective: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Assess convexity and pick an optimizer accordingly: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement gradient descent with line search and stopping rules: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Measure conditioning and understand its training impact: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Use momentum to accelerate and smooth updates: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 4.1: Optimization problem setup: objectives, constraints, and notation

Section 4.1: Optimization problem setup: objectives, constraints, and notation

Start with a precise problem statement. In ML we typically minimize an objective of the form

Unconstrained: minimize f(θ) where θ ∈ ℝ^d. Commonly f(θ)= (1/n)∑ᵢ ℓ(θ; xᵢ,yᵢ) + λR(θ), where ℓ is a data-fit loss (e.g., negative log-likelihood) and R is regularization (e.g., ‖θ‖²/2 for weight decay). Writing the averaging and regularization explicitly matters, because it changes gradient scale and therefore learning-rate choices.

Constrained: minimize f(θ) subject to θ ∈ C. Constraints appear when parameters must be nonnegative (rates, variances), must lie on a simplex (mixture weights), or must satisfy norm bounds. In practice, you often remove constraints via reparameterization (e.g., σ=softplus(s) to enforce σ>0, or π=softmax(u) for a simplex), which lets you keep unconstrained optimizers while respecting the model.

Notation and implementation should align. Decide early whether you treat θ as a flat vector or structured tensors. For optimization logic it is helpful to view θ as a vector; for code you may keep structured arrays but implement operations like “dot over all parameters” and “norm of gradient.” For a robust optimizer module, define a minimal interface:

  • value(θ): returns f(θ)
  • grad(θ): returns ∇f(θ) with the same structure as θ
  • value_and_grad(θ): computes both in one pass when possible (efficiency)

Finally, define stopping rules up front: (1) max iterations; (2) gradient norm below a tolerance; (3) small relative decrease in objective; and (4) safety stops for NaN/Inf. These are not “extras”; they are part of making optimization dependable, especially when you later test different step sizes or minibatch noise.

Section 4.2: Convex sets and convex functions (ML-relevant tests)

Section 4.2: Convex sets and convex functions (ML-relevant tests)

Convexity is less about “being fancy” and more about knowing what kinds of failures are possible. If f is convex and differentiable, any local minimum is global, and gradient methods with suitable step sizes behave predictably. If f is nonconvex (deep networks, many latent-variable models), you can still optimize, but you should expect local minima, saddle points, and heavier dependence on initialization and scheduling.

A set C is convex if for any θ₁,θ₂ ∈ C and any t∈[0,1], the point tθ₁+(1−t)θ₂ is also in C. This is why constraints like “θ ≥ 0” (elementwise) or “‖θ‖₂ ≤ r” are convex, while “θ is sparse with exactly k nonzeros” is not. When constraints are convex, projection-based methods are possible (projected gradient descent), though in ML reparameterization is often simpler.

A function is convex if f(tθ₁+(1−t)θ₂) ≤ tf(θ₁)+(1−t)f(θ₂). Practical ML tests you should remember:

  • Hessian test: if ∇²f(θ) is positive semidefinite for all θ, then f is convex. For least squares, ∇²f = XᵀX/n + λI, clearly PSD.
  • Composition rules: if g is convex and nondecreasing and h is convex, then g∘h is convex. This helps when analyzing losses like logistic loss.
  • Log-sum-exp: LSE(a)=log∑ᵢ exp(aᵢ) is convex. This underlies why softmax cross-entropy is convex in the linear logits but not in deep network parameters.

How does this guide optimizer choice? If your objective is convex and smooth (e.g., ridge regression, logistic regression), you can expect stable convergence from batch gradient descent or quasi-Newton methods. If it is convex but nonsmooth (e.g., L1 regularization), you might need subgradients or proximal methods. If it is nonconvex, first-order methods still dominate in large-scale ML, but you should lean on heuristics: learning-rate schedules, momentum, normalization, and careful monitoring.

Section 4.3: Lipschitz gradients, smoothness, and step-size intuition

Section 4.3: Lipschitz gradients, smoothness, and step-size intuition

The gradient tells you “which way is downhill,” but smoothness tells you “how far you can step before the gradient becomes misleading.” A differentiable function f is L-smooth if its gradient is Lipschitz: ‖∇f(θ)−∇f(φ)‖ ≤ L‖θ−φ‖ for all θ,φ. Intuitively, L is a curvature upper bound: large L means the landscape can bend sharply, requiring smaller steps.

Why you care: for L-smooth convex functions, gradient descent with step size η ≤ 1/L guarantees the objective decreases each step. You rarely know L exactly for modern models, but you can still use the intuition: if your loss explodes or oscillates wildly, your effective η is too large for the curvature you’re hitting.

Line search turns this intuition into a robust procedure. In batch settings, implement backtracking line search: start with a candidate step η, propose θ′=θ−ηg, and accept if the objective decreases “enough.” A common rule is the Armijo condition: f(θ′) ≤ f(θ) − cη‖g‖² with c in (0,1), often 1e−4. If it fails, shrink η by a factor (e.g., 0.5) and retry. This makes gradient descent far more forgiving when you cannot pre-tune a learning rate.

Stability details matter in code. Always compute the objective and gradient in a numerically stable way (e.g., log-sum-exp for softmax, avoid subtracting nearly equal floats). If line search repeatedly rejects steps, inspect for NaNs/Inf, exploding gradients, or a mismatch between your objective and gradient implementations. A practical stopping rule pair is: stop when ‖g‖₂ ≤ tol·(1+‖θ‖₂) or when relative improvement |f_{k+1}−f_k|/max(1,|f_k|) is small for several iterations. This prevents quitting too early on noisy objectives or wasting time when progress is negligible.

Section 4.4: Conditioning, curvature, and why training can be slow

Section 4.4: Conditioning, curvature, and why training can be slow

Two objectives can be equally convex and smooth, yet one trains in seconds and the other crawls. The difference is often conditioning. In a quadratic approximation around a point, f(θ) ≈ f(θ*) + 1/2 (θ−θ*)ᵀH(θ−θ*), the Hessian H describes curvature. If H has eigenvalues λ₁≥…≥λ_d>0, then the condition number κ = λ₁/λ_d measures how stretched the landscape is—think “long narrow valley.” Large κ means some directions are steep (forcing small step sizes) while others are flat (progress is slow).

In least squares, H = XᵀX/n + λI, so conditioning is tied to feature scaling and collinearity. This is why standardization (zero mean, unit variance) and whitening can dramatically improve optimization even though they do not change the statistical model class. In deep learning, conditioning is influenced by initialization, normalization layers, and architecture; the same “valley” concept appears as anisotropic curvature.

Practical measurement: you usually cannot form H explicitly in large models, but you can still diagnose ill-conditioning. Symptoms include (1) training improves only with extremely small learning rates, (2) loss decreases but very slowly despite stable gradients, and (3) parameter updates appear to zig-zag. For smaller problems, estimate κ by computing eigenvalues of the Hessian or the empirical Fisher; for medium problems, use power iteration on Hessian-vector products to estimate the top eigenvalue and track gradient norms to infer flat directions.

Fixes are a mix of modeling and optimization choices. Modeling fixes: rescale inputs, add sensible regularization, or reparameterize (e.g., optimize log-variance instead of variance). Optimization fixes: use momentum (next section), adaptive learning rates (Adam/RMSProp), or second-order-ish preconditioning (e.g., diagonal scaling). Importantly, don’t confuse ill-conditioning with “bad data”; it’s often an avoidable numerical geometry issue.

Section 4.5: Gradient descent variants: batch vs stochastic, minibatching

Section 4.5: Gradient descent variants: batch vs stochastic, minibatching

Batch gradient descent uses the full dataset gradient each step: g = (1/n)∑ᵢ ∇ℓᵢ(θ) + λ∇R(θ). It is stable and works well when n is small-to-medium or when you can afford full passes. With line search, batch GD can be surprisingly robust, making it a good baseline for convex objectives and for debugging your gradient implementation.

Stochastic gradient descent (SGD) replaces the full gradient with an unbiased estimate from one example (or a minibatch B): ĝ = (1/|B|)∑_{i∈B} ∇ℓᵢ(θ) + λ∇R(θ). This reduces per-step cost and often reaches good solutions faster in wall-clock time. The tradeoff is noise: objective values can fluctuate, and classical line search conditions may fail because f(θ) on a minibatch is not the true f.

Minibatching is the practical sweet spot. Use batches large enough for efficient vectorization (GPU-friendly) but small enough to keep updates frequent. As batch size increases, gradient noise decreases, so you can often increase the learning rate; as batch size decreases, you may need smaller η or momentum to stabilize. Common mistakes include forgetting to average gradients by batch size (learning rate becomes batch-size dependent) and mixing regularization scaling inconsistently (e.g., applying λ as if it were per-example but coding it as per-batch).

Engineering workflow for a robust optimizer module (your checkpoint for this chapter): implement a generic loop that supports (1) batch GD with optional backtracking line search, (2) SGD/minibatch with fixed η and schedule (step decay, cosine, or warmup+decay), (3) gradient clipping for safety, and (4) checkpointing best parameters by validation loss. Log: iteration, objective estimate, ‖ĝ‖, step size, and (optionally) parameter norm. These logs turn “it didn’t train” into actionable diagnosis.

Section 4.6: Momentum and Nesterov acceleration (conceptual + practical)

Section 4.6: Momentum and Nesterov acceleration (conceptual + practical)

Momentum addresses two realities: gradients can be noisy (SGD) and curvature can be ill-conditioned (narrow valleys). The core idea is to maintain a velocity v that accumulates gradients over time, producing updates that persist in consistent directions and damp oscillations in steep directions.

Classical momentum (Polyak): v_{k+1} = βv_k + ĝ_k, θ_{k+1} = θ_k − ηv_{k+1}, where β∈[0,1) is typically 0.9 to 0.99. If gradients keep pointing similarly, v grows and you move faster; if gradients alternate (zig-zag across a valley), the average cancels and oscillations reduce. Practically, momentum often lets you use a larger learning rate than plain SGD at the same stability.

Nesterov momentum looks ahead: compute the gradient at the anticipated next position, not the current one. One common form: v_{k+1} = βv_k + ĝ(θ_k − ηβv_k), then θ_{k+1} = θ_k − ηv_{k+1}. Conceptually, it is “momentum with correction,” often improving stability and convergence on smooth problems. In many deep-learning libraries, “Nesterov=True” implements a closely related variant; what matters is the behavior: fewer overshoots and better progress in curved regions.

Tuning guidance: treat η and β as coupled. If you increase β, you often should decrease η slightly to avoid overshooting, especially early in training. Add a learning-rate schedule (e.g., decay η over time) because momentum can keep you bouncing around a minimum if η is too large late in training. Common mistakes include (1) resetting velocity accidentally when loading checkpoints, causing training to change behavior, and (2) combining momentum with very small batches without monitoring gradient variance, leading to unstable velocity spikes. A simple safeguard is gradient clipping (by norm) before updating v, plus logging the velocity norm to catch runaway dynamics.

Chapter milestones
  • Assess convexity and pick an optimizer accordingly
  • Implement gradient descent with line search and stopping rules
  • Measure conditioning and understand its training impact
  • Use momentum to accelerate and smooth updates
  • Checkpoint: build a robust optimizer module for a toy objective
Chapter quiz

1. Why does the chapter emphasize assessing convexity before choosing an optimizer?

Show answer
Correct answer: Convexity determines what optimization guarantees you can rely on and helps guide algorithm choice
The chapter frames convexity as part of objective geometry that informs reliability/guarantees and which optimizer is appropriate.

2. In the chapter’s “optimization as a system” mindset, which combination best captures the main components you must consider together?

Show answer
Correct answer: Objective geometry (convexity/curvature), algorithm choice (GD/SGD/momentum), and implementation details (numerics/stopping/vectorization)
The summary explicitly treats optimization as a system: geometry, algorithm selection, and implementation details.

3. A model “only learns with a tiny learning rate” and otherwise diverges. According to the chapter, what is the most appropriate first debugging move?

Show answer
Correct answer: Map the symptom back to step size, conditioning, or noise and adjust using step-size reasoning or line search
The chapter suggests diagnosing failures by linking symptoms to step size, conditioning, and noise, then applying tools like line search and tuning.

4. What is the primary reason to measure conditioning when training a model?

Show answer
Correct answer: To predict slow directions in optimization and understand why progress can be uneven across parameters
Conditioning is presented as a way to anticipate slow directions and training difficulties tied to curvature.

5. What is the main role of momentum in first-order optimization as described in the chapter?

Show answer
Correct answer: Accelerate optimization while smoothing/damping noisy updates
Momentum is described as a practical add-on to accelerate progress and reduce the effect of noise in updates.

Chapter 5: Practical Optimizers—SGD, Adam, Schedules, and Stability

Optimization is where probability assumptions become working models. You can derive a negative log-likelihood or a MAP objective perfectly and still fail to train if step sizes, scaling, and numerics are off. This chapter turns “take gradients” into a repeatable engineering workflow: implement core optimizers from scratch, pick learning rates using measurable signals, apply stability tactics (clipping, weight decay, safeguards), and debug failures systematically. The goal is not to memorize tricks, but to understand why they work so you can adapt them to logistic regression, small MLPs, and beyond.

A practical mindset helps: (1) get a baseline with SGD or Adam that trains without NaNs; (2) measure gradient norms, loss smoothness, and sensitivity to batch size; (3) add schedules and regularization; (4) confirm improvements with ablations. Throughout, keep two reference problems handy for checkpointing: a logistic regression classifier and a small MLP. These are small enough to iterate quickly and rich enough to surface real issues like ill-conditioning and saturation.

Practice note for Implement SGD, RMSProp, and Adam from scratch in Python: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose learning rates and schedules using measurable signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Apply normalization and regularization tactics that help optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Debug divergence and NaNs with a repeatable checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: train logistic regression and a small MLP with tuned optimizers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Implement SGD, RMSProp, and Adam from scratch in Python: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Choose learning rates and schedules using measurable signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Apply normalization and regularization tactics that help optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Debug divergence and NaNs with a repeatable checklist: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Checkpoint: train logistic regression and a small MLP with tuned optimizers: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 5.1: Stochastic optimization: noise, generalization, and minibatch dynamics

Section 5.1: Stochastic optimization: noise, generalization, and minibatch dynamics

Stochastic gradient descent (SGD) replaces the full gradient  with an unbiased estimate from a minibatch. That estimate is noisy: it points in the right direction on average, but its variance depends on batch size, data heterogeneity, and model state. This noise is not only a nuisance; it can help generalization by preventing overly sharp minimizers, especially in overparameterized networks. Practically, you treat minibatch size and learning rate as a coupled choice: larger batches reduce gradient variance and usually permit larger learning rates, but they also change the “temperature” of the optimization and may require schedules to regain generalization.

From an implementation standpoint, SGD is a few lines:    =    -  , where  is the learning rate. The important engineering detail is to keep the update fully vectorized and to separate “parameter storage” from “optimizer state.” A simple interface is: parameters are dicts of numpy arrays; gradients are dicts of same shape; the optimizer object updates parameters in-place each step.

  • Minibatch dynamics: If you double batch size and keep  fixed, training often becomes more stable but slower in terms of epochs; if you scale  up too, you can jump into divergence. Use gradient norm statistics to guide scaling rather than rules-of-thumb alone.
  • Momentum as “noise filtering”: Momentum accumulates a velocity    , smoothing stochastic gradients and improving conditioning along shallow directions. It is typically the first upgrade when plain SGD plateaus.
  • Practical outcome: You should be able to implement SGD (with and without momentum) from scratch and observe how batch size changes loss curve smoothness and stability on logistic regression.

Common mistake: comparing optimizers without controlling for the number of parameter updates. If one run uses bigger batches, it performs fewer updates per epoch; your curves can look misleading. Track both “steps” and “epochs,” and log loss vs steps when comparing optimizers.

Section 5.2: Adaptive methods: RMSProp and Adam—derivation-level view

Section 5.2: Adaptive methods: RMSProp and Adam—derivation-level view

Adaptive methods change the effective learning rate per parameter based on recent gradient magnitudes. RMSProp maintains an exponential moving average (EMA) of squared gradients,  =   + (1-) , then scales the update by 1/(+). Intuitively, coordinates with consistently large gradients get smaller steps, which helps when features are poorly scaled or curvature differs across dimensions. This connects to conditioning: you are approximating a diagonal preconditioner without computing Hessians.

Adam adds momentum (EMA of gradients) and corrects the bias introduced by initializing EMAs at zero. It keeps  =   + (1-) and  =   + (1-), then forms bias-corrected estimates  = /(1-^) and  = /(1-^). Update:           =     -   / (()+).

When you implement RMSProp and Adam from scratch, two details matter for stability: (1) always add  to the denominator inside the square root to avoid division by zero; (2) store optimizer state (m, v, timestep) per parameter array and keep dtype consistent (float32 vs float64). Adam can “feel” robust because it rescales, but it can still diverge with too-high base learning rate or bad initialization, and it can overfit if you do not pair it with appropriate regularization.

  • Measurable signal: track per-layer RMS of gradients and per-layer update-to-weight ratio  / . If updates are consistently larger than weights, reduce  or add clipping/decay.
  • Common mistake: forgetting bias correction in Adam (or applying it incorrectly). Early steps then become too small, giving an artificial “warmup” that later breaks when you change schedules.
  • Practical outcome: you should be able to train an MLP that fails with plain SGD at a naive learning rate, then succeeds with Adam at a controlled  and good safeguards.
Section 5.3: Learning-rate schedules: warmup, cosine decay, step decay

Section 5.3: Learning-rate schedules: warmup, cosine decay, step decay

A fixed learning rate is rarely optimal from start to finish. Early in training, gradients can be poorly calibrated due to random initialization and changing activation statistics; later, you want smaller steps to refine the solution without bouncing around. Learning-rate schedules encode this intuition explicitly. The key is to select schedules using signals you can measure: loss curvature (how noisy the loss is step-to-step), gradient norm trends, and whether validation loss lags training loss due to underfitting or overfitting.

Warmup gradually increases the learning rate from a small value to your target over the first  steps. This reduces the risk of immediate divergence when using large batches, normalization layers, or Adam with aggressive settings. A linear warmup is easy: ()= / for   , then hold or decay.

Cosine decay reduces the learning rate smoothly: ()= + 0.5(-)(1+( /)). It is popular because it avoids abrupt changes that can destabilize adaptive optimizers and it often yields good final convergence without manual “drop points.” Step decay (drop  by a factor at set epochs) remains useful when you can identify plateau regions in the loss curve and want a simple, interpretable rule.

  • Workflow: find a stable base learning rate first (no schedule) using a short run; then add warmup if the first few hundred steps are unstable; then apply cosine or step decay to improve final validation performance.
  • Common mistake: changing both optimizer and schedule at once. If performance improves, you will not know whether it was Adam vs cosine, or just a smaller effective step size.
  • Practical outcome: implement schedules as a pure function lr(step), log lr alongside loss, and confirm that learning-rate changes correlate with changes in loss slope and gradient norms.

Engineering judgement: if your training is stable and underfitting, try higher peak  with warmup; if it is overfitting, decays and weight decay will matter more than peak .

Section 5.4: Gradient clipping, weight decay vs L2, and numerical safeguards

Section 5.4: Gradient clipping, weight decay vs L2, and numerical safeguards

Stability issues often appear as exploding gradients, sudden loss spikes, or NaNs. Gradient clipping is a direct control knob: it caps the size of the gradient before applying the optimizer update. The most common form is global norm clipping: if   > , rescale all gradients by /. This preserves direction while limiting step magnitude. It is especially helpful in RNNs, deep MLPs with poor initialization, and any setting with rare but huge gradients (outliers, heavy-tailed data).

Regularization interacts with optimization. “L2 regularization” typically means adding   to the objective, which adds   to the gradient. “Weight decay” means directly shrinking parameters each step:   (1-) before or alongside the gradient update. For plain SGD they are equivalent (up to learning rate scaling), but for Adam they differ meaningfully because Adam rescales gradients coordinate-wise. Decoupled weight decay (AdamW-style) is often preferred: keep the adaptive gradient update, and apply decay as a separate multiplicative shrinkage so it behaves consistently across parameters.

  • Numerical safeguards: add epsilons in denominators; clamp probabilities away from 0/1 before log; use stable softmax/log-sum-exp; check for inf/NaN after forward and backward passes.
  • Logging for stability: record global grad norm, max absolute parameter value, and number of NaNs in activations each N steps. These three signals often pinpoint the first failure.
  • Practical outcome: when an MLP run diverges, you can add global norm clipping (e.g., 1.0), switch to decoupled weight decay, and verify that loss spikes disappear without silently freezing learning.

Common mistake: using clipping to “fix” a learning rate that is far too high. Clipping should be a seatbelt, not the engine. If clipping activates almost every step (grad norm always above threshold), reduce the learning rate or fix scaling/initialization.

Section 5.5: Initialization, feature scaling, and normalization effects

Section 5.5: Initialization, feature scaling, and normalization effects

Many “optimizer problems” are really parameterization problems. If input features have wildly different scales, the loss surface becomes ill-conditioned: some directions require tiny steps while others could take large steps, so a single learning rate struggles. For logistic regression, the cure is straightforward: standardize features (zero mean, unit variance) and include an intercept term. You should see faster convergence, smoother loss curves, and less sensitivity to learning rate. This is a concrete checkpoint: if your from-scratch logistic regression only trains with extremely small , suspect feature scaling first.

In neural networks, initialization and normalization shape gradient flow. Proper variance-preserving initialization (e.g., He for ReLU-like activations, Xavier for tanh-ish activations) keeps activations from shrinking to zero or blowing up layer by layer. If early activations saturate (e.g., sigmoid/tanh near 1), gradients vanish and training plateaus; if activations explode, gradients can explode too. Batch normalization or layer normalization can stabilize training by controlling activation statistics, effectively smoothing optimization and allowing larger learning rates. But normalization is not free: it changes the effective objective and may require warmup and different weight decay settings.

  • Engineering workflow: before tuning optimizers, validate: inputs are scaled; initialization matches activation; biases start near zero; loss implementation is numerically stable.
  • Signal-based tuning: inspect activation histograms per layer. If most ReLU activations are zero, reduce weight decay, adjust initialization, or increase learning rate cautiously; if activations are huge, lower learning rate or add normalization.
  • Practical outcome: a small MLP that previously plateaued should begin decreasing loss steadily after correcting scaling/initialization, even with the same optimizer.

Common mistake: adding batch norm and concluding “Adam fixed it.” Often the improvement is primarily from better-conditioned optimization due to normalization; the optimizer choice becomes secondary once gradients behave.

Section 5.6: Optimization debugging playbook (loss curves, grads, activations)

Section 5.6: Optimization debugging playbook (loss curves, grads, activations)

When training fails, resist random hyperparameter changes. Use a playbook that localizes the failure mode. Start with the smallest reproducible run: a fixed seed, a tiny dataset subset, and frequent logging. Your first question is whether the model can overfit a small batch (e.g., 128 examples). If it cannot drive training loss near zero, you likely have an implementation bug, a data/label mismatch, or a severe optimization barrier (saturation, wrong scaling). If it can overfit the small batch but not the full dataset, the issue is usually schedule/regularization/generalization, not basic correctness.

Read loss curves diagnostically. Immediate divergence (loss  inf or NaN in the first steps) suggests too-high learning rate, unstable numerics (log of zero, softmax overflow), or exploding activations. Slow monotone decrease with early plateau suggests too-low learning rate or vanishing gradients. Noisy loss with occasional spikes suggests borderline learning rate, outlier batches, or missing gradient clipping. Validation loss improving then degrading indicates overfitting: consider stronger weight decay, earlier decay schedule, or data augmentation rather than changing optimizers.

  • Checklist: (1) verify forward-pass ranges (logits, probabilities); (2) verify loss is finite; (3) finite-difference check a few gradients; (4) log global grad norm and per-layer grad RMS; (5) log activation means/variances; (6) test smaller , add warmup, add clipping; (7) confirm weight decay is applied as intended (decoupled vs L2).
  • NaN response plan: stop on first NaN, print which tensor first became non-finite, and dump the minibatch inputs/labels. Many NaNs are data issues (unexpected categories, all-zero features, extreme values) masquerading as optimizer issues.
  • Checkpoint exercise (practical): train logistic regression and a small MLP with your from-scratch SGD, RMSProp, and Adam; choose learning rates by observing grad norms and update/weight ratios; then add warmup + cosine decay and verify improved final validation loss without instability.

A final engineering habit: checkpoint parameters and optimizer state. If a run diverges at step 12,000, you want to resume from step 11,500 with a smaller learning rate or tighter clipping, not restart from scratch. Optimizer state (momentum buffers, Adam moments) is part of the model; saving only weights can make resumed training behave unpredictably.

Chapter milestones
  • Implement SGD, RMSProp, and Adam from scratch in Python
  • Choose learning rates and schedules using measurable signals
  • Apply normalization and regularization tactics that help optimization
  • Debug divergence and NaNs with a repeatable checklist
  • Checkpoint: train logistic regression and a small MLP with tuned optimizers
Chapter quiz

1. According to the chapter, why can a perfectly derived NLL or MAP objective still fail to train in practice?

Show answer
Correct answer: Because step sizes, scaling, and numerical stability can be wrong even if the math is correct
The chapter emphasizes that optimization can fail due to learning-rate choices, poor scaling, and numerical issues (e.g., NaNs), despite correct derivations.

2. Which sequence best matches the chapter’s recommended practical workflow for building a stable training setup?

Show answer
Correct answer: Start with SGD/Adam that trains without NaNs → measure gradient norms/loss smoothness/batch-size sensitivity → add schedules and regularization → confirm with ablations
The chapter outlines a repeatable engineering workflow that begins with a stable baseline, then uses measurable signals and ablations to justify changes.

3. What is the main reason the chapter recommends choosing learning rates and schedules using measurable signals?

Show answer
Correct answer: It makes hyperparameter choices reproducible and grounded in observed training behavior
Learning-rate and schedule decisions should be guided by observed signals (e.g., gradient norms, loss behavior), not guesswork.

4. Which set of tactics is presented as helping with optimization stability in this chapter?

Show answer
Correct answer: Clipping, weight decay, and numerical safeguards
The chapter explicitly cites stability tactics such as clipping, weight decay, and safeguards to prevent divergence/NaNs.

5. Why does the chapter suggest keeping a logistic regression model and a small MLP as reference checkpoint problems?

Show answer
Correct answer: They are small enough to iterate quickly yet expose real optimization issues like ill-conditioning and saturation
These reference problems make debugging and iteration fast while still surfacing practical issues that matter in larger models.

Chapter 6: Constrained & Probabilistic Optimization—KKT, EM, and Variational Ideas

Many ML objectives are “just” unconstrained minimizations of a loss, but real models frequently come with constraints (probabilities must sum to 1, variances must be positive, fairness or budget constraints must hold) and with probabilistic structure (latent variables, priors, approximate posteriors). This chapter connects these worlds by treating constrained problems and probabilistic inference as optimization problems with carefully chosen objectives.

You will see three recurring patterns. First, constraints can be handled explicitly (projections), softly (penalties/barriers), or exactly (Lagrangians and KKT). Second, latent-variable learning often becomes coordinate ascent on a bound (EM). Third, approximate Bayesian inference is “optimize an evidence lower bound (ELBO)” with either analytic updates (mean-field) or gradient estimators (reparameterization). Throughout, the engineering focus is: translate assumptions into an objective, derive stable gradients/updates, and choose an optimizer that matches the objective’s geometry and noise.

  • Outcome: Given probability assumptions, you should be able to write down the objective (MLE/MAP/ELBO), derive the key updates/gradients, implement them in vectorized code, and diagnose when optimization fails due to constraints, ill-conditioning, or variance in gradient estimates.

We will close with a capstone workflow that looks like real work: define a probabilistic model, pick MLE vs MAP vs VI, implement an optimizer, and evaluate both optimization behavior and predictive performance.

Practice note for Solve constrained ML problems using Lagrangians and KKT conditions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Derive and implement EM for a simple latent-variable model: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand ELBO and implement a minimal variational inference loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compare MLE/MAP/VI in terms of objectives and behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Capstone: end-to-end probabilistic model with an optimizer you implement: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Solve constrained ML problems using Lagrangians and KKT conditions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Derive and implement EM for a simple latent-variable model: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Understand ELBO and implement a minimal variational inference loop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Practice note for Compare MLE/MAP/VI in terms of objectives and behavior: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.

Sections in this chapter
Section 6.1: Constrained optimization in ML: projections and penalties

Section 6.1: Constrained optimization in ML: projections and penalties

Constraints appear everywhere in ML, often disguised as “valid parameterizations.” Examples: mixture weights must lie on the simplex, covariance matrices must be positive semidefinite, probabilities must be in (0, 1), and some models impose resource constraints (sparsity budgets, monotonicity, fairness constraints). You can handle these constraints in three practical ways: (1) project after an update, (2) add penalties or barriers to the objective, or (3) reparameterize to remove the constraint.

Projected gradient is the simplest pattern: take an unconstrained step and then project back to the feasible set. For a simplex constraint (w ≥ 0, sum w = 1), you can use an efficient Euclidean projection algorithm. The engineering judgment is: projections are appealing when the projection is cheap and stable, and when you want hard feasibility at every step (e.g., keeping probabilities normalized avoids downstream NaNs).

Penalty methods convert a constrained problem into unconstrained optimization by adding a term like λ·violation. For equality constraints c(x)=0, a squared penalty λ||c(x)||2 is common; for inequalities g(x)≤0, hinge-like penalties max(0, g(x))2 work. Penalties are easy to code but require tuning λ: too small and constraints are ignored; too large and the problem becomes ill-conditioned, causing tiny steps or divergence in SGD/Adam.

Barrier methods (e.g., -μ sum log(-g(x))) enforce inequalities by making the objective blow up at the boundary. Barriers keep iterates strictly feasible, but they can be numerically delicate near the boundary; you must use stable evaluations (e.g., clamp inputs to log) and careful step sizes.

Reparameterization is often the most robust approach: enforce constraints by construction. Use softmax for simplex weights, exp/softplus for positive parameters, and Cholesky factors for PSD matrices. The tradeoff is that reparameterization changes the geometry of the optimization: gradients can saturate (softmax under extreme logits), so initialization and learning rates matter.

  • Common mistake: mixing constraints and unconstrained updates inconsistently—e.g., updating mixture weights with SGD but forgetting to renormalize, leading to negative “probabilities” and invalid likelihoods.
  • Practical outcome: pick the least fragile strategy: reparameterize when you can, project when projection is cheap, and use penalties/barriers when constraints are complex but you can tolerate tuning.
Section 6.2: Lagrange multipliers and KKT conditions (worked ML examples)

Section 6.2: Lagrange multipliers and KKT conditions (worked ML examples)

Lagrange multipliers and KKT conditions are the language of “optimality with constraints.” Even when you ultimately implement a projected or reparameterized solver, KKT gives you a correctness check and often yields closed-form updates. For a problem minimize f(x) subject to equality constraints h(x)=0 and inequality constraints g(x)≤0, define the Lagrangian L(x, λ, μ) = f(x) + λTh(x) + μTg(x), with μ ≥ 0. KKT conditions combine: stationarity (∇xL=0), primal feasibility (constraints hold), dual feasibility (μ ≥ 0), and complementary slackness (μi gi(x)=0).

Worked example 1: maximum entropy / simplex-constrained probabilities. Suppose you want probabilities p over K classes that maximize entropy subject to matching an expected feature: minimize f(p)=∑k pk log pk subject to &sum pk=1 and p≥0 and &sum pkak=b. The Lagrangian yields log pk = -1 - λ - νak, so p is a softmax over -νa. This is an exponential-family identity in disguise: constraints on expectations produce log-linear models. Practically, KKT explains why softmax parameterizations are natural and helps you verify your code’s fixed point.

Worked example 2: SVM hinge loss and margin constraints. The primal soft-margin SVM is minimize (1/2)||w||2 + C∑ξi subject to yi(w·xi+b) ≥ 1 - ξi, ξi≥0. KKT conditions explain sparsity: only points with active constraints become support vectors (nonzero multipliers). Even if you train with modern optimizers on a hinge-like objective, KKT gives the interpretability: when constraints are inactive, their multipliers are zero, so they do not affect the solution.

Engineering use: KKT is also a debugging tool. If you implement a constrained solver, verify complementary slackness numerically: when an inequality is not tight, its multiplier should be ~0. Large violations often mean your step size is too high or your projection is wrong.

  • Common mistake: forgetting that inequality multipliers must be nonnegative; allowing negative μ in code breaks the interpretation and can destabilize primal-dual updates.
  • Practical outcome: you can derive closed-form constrained solutions when they exist, or at least know what “optimal” must satisfy, which guides implementation and tests.
Section 6.3: Expectation-Maximization as coordinate ascent on a bound

Section 6.3: Expectation-Maximization as coordinate ascent on a bound

EM is best understood as optimization when your likelihood involves latent variables z. Directly maximizing log p(x|θ) is hard because log &sumz p(x,z|θ) has a log-sum that couples parameters. EM introduces a distribution q(z) and uses Jensen’s inequality to form a lower bound: log p(x|θ) ≥ Eq[log p(x,z|θ)] - Eq[log q(z)] = ELBO(q,θ). EM alternates maximizing this bound in two coordinate steps.

E-step: set q(z) to the exact posterior under current parameters: q(z)=p(z|x,θold). This maximizes the ELBO with respect to q and makes the bound tight at θold.

M-step: maximize Eq[log p(x,z|θ)] with respect to θ (the entropy term does not depend on θ). Many models yield closed-form updates; otherwise you can do a few gradient steps (generalized EM). This is a clean example of “optimize a surrogate objective that is easier than the original.”

Implementation workflow: (1) write the complete-data log-likelihood log p(x,z|θ), (2) compute responsibilities or posterior expectations under q, (3) maximize expected complete-data log-likelihood. Numerically, you must compute posterior probabilities stably using log-sum-exp; avoid exponentiating raw logits when dimensions or distances get large.

Convergence behavior: EM monotonically increases the data log-likelihood (for exact E and M steps), but can converge slowly near optima because it behaves like a second-order method with a particular preconditioning. It is also sensitive to initialization and can get stuck in poor local maxima—this is not a bug; it reflects non-convexity in latent-variable models.

  • Common mistake: computing responsibilities with naive normalization (divide by sum of exponentials) without subtracting the max; this often underflows to zeros and produces degenerate updates.
  • Practical outcome: you can turn an intractable marginal likelihood into alternating updates that are simple, testable, and often faster than generic SGD for small/medium latent models.
Section 6.4: Latent-variable models: mixture models and responsibility updates

Section 6.4: Latent-variable models: mixture models and responsibility updates

A concrete place to practice EM is a mixture model. Consider a Gaussian mixture model with K components: zn ~ Categorical(π), xn|zn=k ~ N(μk, Σk). The latent indicator z chooses a component; learning alternates between inferring soft assignments and updating component parameters.

E-step responsibilities: rnk = p(zn=k|xn,θ) ∝ πk N(xnkk). Implement in log space: log rnk = log πk + log N(xnkk) - logsumexp over k. Use vectorization: compute a (N,K) matrix of log-probs, apply logsumexp across K, then exponentiate to get r.

M-step updates (for full EM): Nk=∑n rnk, πk=Nk/N, μk=(1/Nk)&sum rnkxn, and Σk=(1/Nk)&sum rnk(xnk)(xnk)T. In code, add a small diagonal jitter (e.g., 1e-6 I) to Σ to prevent singular matrices.

MAP variants: If you add priors (Dirichlet on π, Normal-Inverse-Wishart on Gaussian parameters), the M-step becomes MAP rather than MLE, effectively adding pseudo-counts and regularization. This is often worth it in practice because pure MLE GMMs can collapse a component onto a single point (likelihood goes to infinity as variance goes to zero). A weak prior or variance floor prevents this degeneracy.

Diagnostics: Track log-likelihood each iteration; it should not decrease for exact EM. Watch for components with Nk near zero (dead components), exploding responsibilities (numerical issues), or singular covariances. These are not just “bugs”; they are signals to adjust initialization (k-means), add priors, or constrain covariance structure (diagonal/shared).

  • Common mistake: updating π by normalizing raw counts but forgetting to ensure π stays away from 0 (log π=-∞ breaks E-step). Clamp or use a Dirichlet prior.
  • Practical outcome: you can derive EM updates and implement a stable, vectorized mixture model that serves as a template for more complex latent-variable models.
Section 6.5: Variational inference: ELBO, mean-field, and reparameterization idea

Section 6.5: Variational inference: ELBO, mean-field, and reparameterization idea

When exact posteriors are intractable, variational inference (VI) turns inference into optimization: choose a family q(z; φ) and maximize the ELBO, ELBO(φ,θ) = Eq[log p(x,z|θ)] - Eq[log q(z; φ)], which is equivalent to minimizing KL(q(z; φ) || p(z|x,θ)). This is the same bound used in EM, but now q is restricted, so the bound will generally not be tight.

Mean-field is the workhorse approximation: factorize q(z)=∏i qi(zi). Coordinate ascent VI (CAVI) yields updates of the form log qi(zi) ∝ Eq-i[log p(x,z)], which often becomes an exponential-family update if the model is conditionally conjugate. Practically, this is “EM-like” alternating updates but on an approximate posterior rather than the exact one.

Black-box VI uses gradients of the ELBO with respect to φ. Two key estimators appear in practice. (1) The score-function (REINFORCE) estimator works broadly but has high variance. (2) The reparameterization idea reduces variance for continuous latents: if z = g(ε, φ) with ε from a fixed distribution (e.g., z = μ + σε, ε~N(0,1)), then Eq[f(z)] gradients can move inside the expectation and be estimated with low-variance Monte Carlo.

Minimal VI loop: initialize φ (and possibly θ), then iterate: sample ε, form z=g(ε,φ), compute a Monte Carlo estimate of ELBO (or its negative as a loss), backprop to get gradients, and update with Adam/SGD. Use stable parameterizations: optimize log σ or softplus to keep σ>0, and compute log-probabilities in a numerically stable way. A practical baseline is to start with 1 sample per step and increase if gradients are too noisy.

Behavior vs EM: VI trades exactness for scalability and flexibility (nonconjugate models, amortized inference). But it can under-estimate posterior variance (a common mean-field artifact) and the ELBO can be a loose proxy for predictive performance. You must evaluate with held-out log-likelihood estimates, calibration, or task metrics rather than trusting ELBO alone.

  • Common mistake: using an unconstrained variance parameter directly (can become negative) or failing to include the log|Jacobian| term when a transformation changes densities. For the standard Gaussian reparameterization z=μ+σε, the density is handled via log q(z) with σ>0.
  • Practical outcome: you can implement a working VI optimizer loop and understand what the ELBO is actually optimizing—and what it is not.
Section 6.6: Putting it together: objective design, optimizer choice, and evaluation

Section 6.6: Putting it together: objective design, optimizer choice, and evaluation

The unifying skill is objective design: decide what you are maximizing, and why. MLE maximizes log p(x|θ). MAP maximizes log p(x|θ) + log p(θ), acting like regularization with a probabilistic meaning. VI maximizes ELBO(φ,θ), approximating Bayesian inference by optimizing over distributions. These objectives can behave very differently: MAP can prevent degeneracy (e.g., mixture collapse), while VI introduces approximation bias but can scale and provide uncertainty estimates.

Capstone workflow (end-to-end): Build a small probabilistic model such as a diagonal-covariance GMM, then create three training modes: (1) MLE via EM, (2) MAP-EM with a Dirichlet prior on π and a variance floor, and (3) VI for a simplified latent model where you treat component assignments with a relaxed approximation and optimize an ELBO with gradients. Implement at least one optimizer yourself (e.g., SGD with momentum or Adam) for the gradient-based mode, including learning-rate schedules and gradient clipping. This forces you to connect derivations to code paths and numerics.

Optimizer choice: EM is often best when you have closed-form updates and moderate data sizes. For gradient-based ELBO training, Adam is a strong default because of scale differences between parameters (means vs log-variances vs logits). Still, monitor conditioning: if parameters have very different curvature, even Adam can stall; consider smaller step sizes, gradient clipping, or second-order structure (natural gradients in VI are a classic extension). For constrained parameters, prefer reparameterizations (softmax/softplus) over hard penalties that create ill-conditioning.

Evaluation and debugging: Track (a) training objective (log-likelihood or ELBO), (b) constraint satisfaction (simplex sums, positivity), and (c) predictive metrics on held-out data. If the objective diverges, check for numerical underflow/overflow (logsumexp), invalid parameters (negative variances), or step sizes that are too large. If progress plateaus, look for label switching or dead mixture components, poor initialization, or overly tight approximate families in VI. If results look “too certain,” suspect mean-field variance underestimation and validate with posterior predictive checks.

  • Common mistake: comparing MLE log-likelihood to ELBO values directly as if they were the same quantity. ELBO is a lower bound and depends on the variational family; use held-out estimates for apples-to-apples comparisons.
  • Practical outcome: you can choose between MLE, MAP, EM, and VI based on modeling goals and engineering constraints, implement stable training loops, and evaluate whether failures come from modeling assumptions or from optimization/numerics.
Chapter milestones
  • Solve constrained ML problems using Lagrangians and KKT conditions
  • Derive and implement EM for a simple latent-variable model
  • Understand ELBO and implement a minimal variational inference loop
  • Compare MLE/MAP/VI in terms of objectives and behavior
  • Capstone: end-to-end probabilistic model with an optimizer you implement
Chapter quiz

1. Which pairing best matches each method to how it handles constraints in optimization?

Show answer
Correct answer: Projections = exact feasibility each step; Penalties/barriers = soft or interior enforcement; Lagrangians/KKT = exact conditions at optimum
The chapter highlights three patterns: projections enforce feasibility directly, penalties/barriers incorporate constraints into the objective, and Lagrangians/KKT characterize optimality under constraints.

2. In the chapter’s view, EM for latent-variable models is best described as:

Show answer
Correct answer: Coordinate ascent on a bound with alternating updates, improving an objective each iteration
EM is presented as alternating optimization (E/M steps) that performs coordinate ascent on a bound.

3. What is the key objective optimized in variational inference as described in the chapter?

Show answer
Correct answer: An evidence lower bound (ELBO) that trades off data fit and approximate-posterior complexity
VI is framed as optimizing an ELBO, often with analytic mean-field updates or gradient estimators like reparameterization.

4. Which statement correctly distinguishes MLE, MAP, and VI in terms of what they optimize?

Show answer
Correct answer: MLE optimizes likelihood; MAP optimizes likelihood plus a prior term (posterior mode); VI optimizes an ELBO over an approximate posterior distribution
The chapter emphasizes writing down the correct objective: MLE for likelihood, MAP for posterior mode using priors, and VI for an ELBO with an approximate posterior.

5. If training becomes unstable due to constraints, ill-conditioning, or noisy gradient estimates, what workflow best reflects the chapter’s engineering focus?

Show answer
Correct answer: Translate assumptions into an objective, derive stable gradients/updates, then choose an optimizer suited to the objective’s geometry and noise
A core takeaway is to move from assumptions → objective (MLE/MAP/ELBO) → stable updates/gradients → optimizer choice, and to diagnose failures tied to constraints, conditioning, or gradient variance.
More Courses
Edu AI Last
AI Course Assistant
Hi! I'm your AI tutor for this course. Ask me anything — from concept explanations to hands-on examples.