Machine Learning — Intermediate
Build gradient descent from scratch and make it converge on real data.
Gradient descent is the engine behind training most machine learning models, but many learners only see it as a formula on a slide: update parameters, repeat. In practice, the difference between a model that learns and a model that diverges is almost always in the details—learning rate, scaling, gradient correctness, batch size, curvature, and careful instrumentation. This book-style course turns gradient descent into something you can build, test, and debug by coding every step yourself.
Across six tightly connected chapters, you will implement a complete optimization toolkit in Python/NumPy: from a basic gradient descent loop to momentum and Adam, plus the monitoring and sanity checks that make training predictable. Each chapter is structured like a short chapter in a technical book: a clear goal, a small set of milestones, and focused sub-sections that build on the previous chapter’s code.
You start with intuition and visualization—what a loss surface is and how steps move you downhill—then quickly transition to correct gradient computation. Once the gradients are trustworthy, you learn the knobs that control convergence: learning rate and batch size, followed by stopping rules and experimental discipline. Next, you tackle the reasons real optimization is hard: ill-conditioning, plateaus, and unstable updates, then apply scaling and regularization to make training behave. With that foundation, you implement momentum and adaptive optimizers and benchmark them fairly. Finally, you apply everything in a capstone by training logistic regression and a small neural network using your own optimizer interface and debugging playbook.
This course is designed for learners who can write basic Python and want a concrete, working understanding of optimization. If you have ever asked “Why won’t my model converge?” or “Why does Adam work better here?” this course gives you a systematic way to answer those questions with evidence.
You won’t just run library calls. You will implement the algorithms, validate them with gradient checking, and learn how to interpret diagnostics like loss curves, gradient norms, and parameter trajectories. The emphasis is on mechanical sympathy: understanding what the optimizer is doing so you can make it work under constraints.
If you want to learn gradient descent in a way that sticks—by coding, testing, and debugging—start here. Register free to access the course, or browse all courses to compare learning paths.
Senior Machine Learning Engineer (Optimization & Training Systems)
Sofia Chen is a Senior Machine Learning Engineer focused on optimization, training stability, and scalable model evaluation. She has built production training pipelines and teaches practical methods for diagnosing and fixing non-converging models through clear math and reproducible code.
Optimization is the engine under nearly every machine learning model you will train. You pick a model family (like linear regression or a neural network), define how wrong the model is (a loss), and then use an optimizer to adjust parameters until the loss is acceptably small. This chapter builds intuition by turning each concept into a runnable experiment: a tiny function with a known minimum, a gradient descent loop you can inspect line by line, and diagnostics that tell you when training is healthy versus unstable.
You will start by setting up a minimal experiment template (a single Python file or notebook cell pattern) and then use it repeatedly: define a function, compute its gradient, update parameters, and log metrics. You will also visualize 1D/2D loss surfaces so you can “see” what your code is doing, rather than treating optimization as a black box. Finally, you will checkpoint your work by reproducing a known minimum with controlled randomness—an early habit that prevents wasted hours when later chapters introduce noisy gradients and more complex models.
By the end of this chapter, you will have a working gradient descent loop, a basic set of convergence signals, and the engineering judgment to spot the most common beginner mistakes (learning rate too large, missing scaling, incorrect gradient signs, and uninstrumented runs that hide failures).
Practice note for Set up the coding environment and experiment template: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Visualize 1D/2D loss surfaces and why minima matter: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write your first gradient descent loop on a simple function: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Measure progress: loss, step size, and convergence signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: reproduce a known minimum with controlled randomness: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Set up the coding environment and experiment template: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Visualize 1D/2D loss surfaces and why minima matter: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Write your first gradient descent loop on a simple function: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Measure progress: loss, step size, and convergence signals: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
In machine learning, “training” usually means solving an optimization problem: find parameters θ that minimize a loss function L(θ). The parameters might be a single number (slope of a line), a vector (weights in logistic regression), or millions of tensors (deep networks). The goal is the same: systematically reduce loss by changing parameters in a way guided by data.
Two practical ideas matter immediately. First, optimization is iterative. You do not solve for θ in one step; you run an update loop that gradually improves the model. Second, optimization is empirical engineering. You will choose hyperparameters (like learning rate) and stopping criteria and then observe whether loss decreases, oscillates, or explodes.
To make this concrete, set up an experiment template you will reuse throughout the course:
loss(theta)grad(theta)thetaA common mistake is to jump straight into model training without a known-good sandbox. In this chapter, you will optimize simple functions where the minimum is known or easy to verify. That gives you a reference point: if your loop cannot find the minimum of a quadratic, it will not reliably train a neural network.
Another mistake is to assume “lower loss” always means “better.” Optimization is about the loss you chose, not necessarily your real-world objective. Later chapters will add regularization and scaling to make the optimization problem better conditioned and more aligned with generalization, but first you need a crisp mental model of the optimization loop itself.
A loss function maps parameters to a single number: L: θ → ℝ. The set of all possible parameter values is the parameter space. When you visualize “loss surfaces,” you are plotting L(θ) over this space. For a single parameter, you can draw a 1D curve; for two parameters, you can draw a 2D contour plot or 3D surface. Beyond two dimensions, you rely on slices, projections, and diagnostics.
Start with a function that has an obvious minimum, such as L(x) = (x - 3)^2. In 1D, you can sample a grid of x values, compute L(x), and plot it. Then extend to 2D with something like L(x, y) = (x - 1)^2 + 5*(y + 2)^2. The coefficient 5 makes the surface “steeper” in y, which is a gentle introduction to curvature and why optimization can move faster in some directions than others.
Why do minima matter? Because a minimum corresponds to parameters that best fit your chosen objective. In real models, the minimum may be global (convex problems like least squares) or one of many local minima (common in deep learning). Even if the landscape is complex, the update rule depends only on local information, so understanding a simple surface helps you interpret training curves later.
Engineering judgment: before tuning any optimizer, examine the scale of your parameters and losses. If one parameter has a natural scale of 1e-3 and another 1e3, your surface will look stretched, and a single learning rate can behave poorly. This is one reason feature scaling and regularization often improve optimization stability—they reshape the loss surface into something easier to navigate.
The gradient tells you how to change parameters to increase the loss fastest; therefore, moving in the negative gradient direction decreases the loss fastest (locally). In 1D, the gradient is just the derivative: if L(x) = (x - 3)^2, then dL/dx = 2(x - 3). When x > 3, the gradient is positive, so subtracting it moves left toward 3. When x < 3, the gradient is negative, so subtracting it moves right toward 3.
In 2D, the gradient is a vector of partial derivatives. For L(x, y) = (x - 1)^2 + 5*(y + 2)^2, the gradient is [2(x - 1), 10(y + 2)]. This immediately explains why naive steps can overshoot in the steep direction: the y component of the gradient can be much larger than the x component, so the same learning rate produces much larger movement in y.
Write your first gradient descent loop here, but keep it intentionally simple: initialize theta, compute g = grad(theta), update theta = theta - lr * g, and repeat. The key learning is not the math; it’s seeing the gradient values, step sizes, and loss values evolve together.
Common mistakes include (1) using the wrong sign (adding the gradient instead of subtracting), (2) mixing up shapes (treating a row vector as a column vector), and (3) implementing an incorrect gradient. Even a small algebraic mistake can produce “almost plausible” progress that stalls or diverges. Later in the course you will use numerical gradient checking to verify gradients; for now, validate by comparing your code’s behavior to the known minimum and to plots of the surface.
The basic gradient descent update rule is:
theta_{t+1} = theta_t - learning_rate * grad(theta_t)
This single line contains most of the engineering decisions you will make in optimization. The learning rate (often lr or alpha) controls how far you move each step. Too small and training is painfully slow; too large and you bounce around or diverge. Your first job is to develop a feel for what “too large” looks like in metrics and plots.
Start with a deterministic function (no data sampling) so you can see clean behavior. Try a learning rate like 0.1 on a quadratic and observe monotonic loss decrease. Then increase to 1.0 and observe whether it oscillates or overshoots. This is not busywork: it trains your intuition for later when losses are noisier and the surface is not a simple bowl.
Hyperparameters to track from the start:
Connect this to batch vs. stochastic thinking: on a fixed analytic function, your gradient is exact (like batch gradient descent on a full dataset). When you later estimate gradients from a subset of data (mini-batch or stochastic), the update rule is the same but the gradient becomes noisy. The best practice is to build a reliable update loop now, then swap in different gradient sources later without rewriting everything.
Optimization runs are not judged by hope; they are judged by signals. Convergence means you are approaching a stable region where additional steps produce tiny improvements. Divergence means your updates are pushing you away from a minimum, often explosively. Plateaus and slowdowns can happen even when things are “working,” so you need criteria that separate healthy slow progress from broken updates.
Practical convergence signals to implement immediately:
||g|| shrinking as you approach a stationary point||lr * g|| becoming small||theta_{t+1} - theta_t|| below a tolerancePractical divergence signals:
inf/nanEngineering judgment: do not wait 10,000 iterations to discover a problem. Add early stopping checks such as “stop if loss is nan” or “stop if loss increases by 10× in one step.” Another common mistake is to declare convergence just because the loss stops changing—if your learning rate is extremely small, you can create an artificial plateau. That is why you should track both loss change and gradient norm; a large gradient with tiny steps suggests your learning rate is throttling progress, not that you are at a minimum.
This section sets up your later ability to diagnose ill-conditioned curvature: when one direction is steep and another is flat, you can see slow zig-zagging in parameter space and recognize that tuning, scaling, or momentum-like methods are needed.
Good optimization code is observable. If you cannot answer “what happened on step 37?” you will not be able to debug learning rate issues, gradient bugs, or numerical instability. Your experiment template should log a compact but informative record each iteration, and it should produce plots that you can compare across runs.
At minimum, store these arrays:
loss_history: scalar loss per iterationgrad_norm_history: np.linalg.norm(g)step_norm_history: np.linalg.norm(lr * g)theta_history: parameter values (or a sampled subset for high dimensions)Then plot loss vs. iteration and (for 2D examples) plot the path of theta over loss contours. Seeing the “zig-zag” pattern is often more educational than any single metric, and it directly connects to later methods like momentum and adaptive learning rates.
Controlled randomness matters even in early chapters. When you introduce randomized initializations or noisy gradients, you must be able to reproduce behavior. Always set seeds consistently (e.g., random.seed(0) and np.random.seed(0)) and record them in your run metadata. A simple checkpoint for this chapter is: pick a function with a known minimum (like a quadratic bowl), choose a random initialization with a fixed seed, and verify that your code reaches the expected parameter values within a tolerance. If changing the seed changes whether you “succeed,” you likely have a learning rate or stopping criterion that is too fragile.
Finally, adopt a small “run header” printout or log dictionary: learning rate, max iterations, tolerance, seed, and initial parameters. This habit scales: as you add stochastic and mini-batch variants, learning rate schedules, and optimizers like Adam, you will be able to compare runs systematically instead of guessing which change helped.
1. Which sequence best matches the chapter’s minimal experiment template for running an optimization loop?
2. Why does the chapter emphasize visualizing 1D/2D loss surfaces early on?
3. What combination of signals is most aligned with the chapter’s idea of measuring progress during gradient descent?
4. A run becomes unstable and the loss increases dramatically after updates. Based on the chapter’s common beginner mistakes, what is the most likely cause?
5. What is the purpose of checkpointing by reproducing a known minimum with controlled randomness?
Gradient descent is only as good as the gradients you feed it. In production code, “close enough” gradients are not close enough: a missing factor of 2, a sign error, or an accidental broadcast can turn steady convergence into divergence or a plateau that never ends. This chapter builds the gradient pipeline end-to-end: start from scalar derivatives, upgrade to partial derivatives, then express everything in vector form so your implementation is fast, readable, and hard to misuse.
We will derive the mean squared error (MSE) gradient for linear regression by hand, implement it in NumPy without loops, and then validate it using finite-difference gradient checking. Along the way, you’ll see why bias terms deserve special handling, which shapes you should standardize on, and how to set tolerances so you can confidently say “my analytical gradient matches the numerical gradient.”
The practical outcome is simple: after this chapter, you should be able to write a linear-model training step from scratch and trust it. That trust is the foundation for the later chapters where we add momentum, adaptive optimizers, learning-rate schedules, and regularization—none of which matter if the base gradient is wrong.
Practice note for Derive gradients for MSE linear regression by hand: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement vectorized gradients with NumPy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Validate gradients with finite differences (gradient checking): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Handle bias terms, shapes, and broadcasting safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: match analytical and numerical gradients within tolerance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Derive gradients for MSE linear regression by hand: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement vectorized gradients with NumPy: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Validate gradients with finite differences (gradient checking): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Handle bias terms, shapes, and broadcasting safely: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: match analytical and numerical gradients within tolerance: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start with the smallest unit: a scalar function of a scalar. If f(w) = w^2, then df/dw = 2w. This is familiar, but the key idea for optimization is the interpretation: the derivative is the local slope, i.e., how much the function changes for a tiny change in w. Gradient descent uses this slope to move downhill: w := w - α df/dw.
Machine learning parameters are rarely scalar. For multiple parameters, you’ll use partial derivatives. If f(w, b) = (w x + b - y)^2, then ∂f/∂w treats b as constant, and ∂f/∂b treats w as constant. Optimization updates all parameters simultaneously using the gradient vector ∇f = [∂f/∂w, ∂f/∂b].
Two rules cover most derivations you’ll need here: the chain rule and linearity of differentiation. Chain rule is what connects the model output to the loss. For example, if f = (r)^2 and r = w x + b - y, then ∂f/∂w = (∂f/∂r)(∂r/∂w) = 2r · x. The same pattern repeats throughout deep learning, only with more layers.
Engineering judgment: write the computation as a sequence of simple intermediate variables (residuals, predictions, etc.), then differentiate each piece. This reduces sign mistakes and makes it easier to later mirror the derivation in code.
Real datasets are collections of examples, so losses are sums (or means) over samples. The moment you see Σ, you should ask: “Can I express this as matrix operations?” Vectorization is not just speed; it is also clarity and fewer bug surfaces.
Set a consistent convention. A common one in NumPy: X has shape (n_samples, n_features), weights w has shape (n_features,) (or (n_features, 1)), and targets y has shape (n_samples,). Predictions are ŷ = Xw + b with shape (n_samples,). Residuals are r = ŷ - y.
Useful identities (with shapes in mind):
Dot product as sum: (Xw)_i = Σ_j X_{ij} w_j, computed as X @ w.
Sum of squared residuals: rᵀr is r @ r.
Gradient of quadratic form: if L = rᵀr and r = Xw - y, then ∂L/∂w = 2 Xᵀ r.
Why do shapes matter? Because many gradient bugs are “silent”: NumPy broadcasting can produce an array of the wrong shape that still computes without error. Decide early whether you represent vectors as rank-1 arrays (d,) or column vectors (d,1), then stick to it. Rank-1 is convenient, but be explicit with keepdims and sanity checks.
We’ll derive gradients for MSE linear regression by hand, then you will implement the same formula. Model: ŷ = Xw + b. Loss (mean squared error):
J(w, b) = (1/n) Σ_i (ŷ_i - y_i)^2 = (1/n) Σ_i (x_i·w + b - y_i)^2.
Define residuals r_i = ŷ_i - y_i. Then J = (1/n) Σ_i r_i^2. Differentiate with respect to w using chain rule:
∂J/∂w = (1/n) Σ_i 2 r_i ∂r_i/∂w. But r_i = x_i·w + b - y_i, so ∂r_i/∂w = x_i. Therefore:
∂J/∂w = (2/n) Σ_i r_i x_i.
Stacking all samples into matrices, the same result becomes:
∇_w J = (2/n) Xᵀ r, where r = Xw + b - y.
For the bias term, treat it as its own parameter. Since ∂r_i/∂b = 1:
∂J/∂b = (2/n) Σ_i r_i = (2/n) 1ᵀ r which in NumPy is simply (2/n) * r.sum().
Common decision: many texts define MSE as (1/2n) Σ r_i^2 to cancel the factor of 2. Either is fine, but your code and your gradient check must match the same definition. Scaling errors here change the effective learning rate and can make “α seems wrong” problems that are actually “gradient scaled wrong” problems.
A correct gradient that takes 10 seconds per step is not useful. For linear regression, you can compute loss and gradients with a handful of vectorized operations. Here is a practical NumPy pattern that is fast and minimizes shape surprises.
Assume X is (n, d), w is (d,), b is a scalar, and y is (n,). Then:
y_hat = X @ w + b
r = y_hat - y
loss = (r @ r) / n (or (r @ r) / (2*n) if using half-MSE)
grad_w = (2/n) * (X.T @ r)
grad_b = (2/n) * r.sum()
These operations map directly to the derivation, which is exactly what you want: fewer translation steps from math to code. Avoid per-sample loops like for i in range(n): unless you are deliberately implementing stochastic gradient descent; even then, keep an efficient batch version for comparison and debugging.
Bias handling options:
Separate scalar bias: simplest gradients and least broadcasting risk.
Augmented feature vector: append a column of ones to X and include b inside w. This can simplify code but requires careful construction to avoid accidentally regularizing the bias later.
Practical outcome: once you can compute (loss, grad_w, grad_b) reliably and quickly, implementing batch, stochastic, and mini-batch gradient descent becomes a question of which subset of rows from X and y you feed into the same function.
Analytical gradients can be wrong in subtle ways. Numerical gradient checking is the safety net: approximate the derivative by measuring how the loss changes when you nudge a parameter. For a parameter component θ_k, central difference is a strong default:
g_k ≈ (J(θ + ε e_k) - J(θ - ε e_k)) / (2ε).
In code, you loop over parameters (for linear regression, each weight and the bias), perturb one at a time, recompute the loss, and assemble a numerical gradient vector. Then compare to the analytical gradient using a relative error metric such as:
rel_err = ||g_num - g_ana|| / max(1, ||g_num|| + ||g_ana||).
Set an explicit tolerance. For float64 and well-scaled problems, 1e-6 to 1e-7 relative error is often achievable. For float32, looser tolerances like 1e-4 may be appropriate.
Common error sources that make gradient checks fail even when the derivation is conceptually right:
ε too small or too large: too small causes catastrophic cancellation; too large causes truncation error. Start around 1e-5 for float64.
Nondeterminism: if your loss involves randomness (dropout, sampling), fix seeds or disable randomness during checks.
Regularization mismatch: if your loss includes L2 penalty, your analytical gradient must include it too (and bias handling must match your design choice).
Shape-induced broadcasting: you may be perturbing one parameter but affecting multiple entries due to unintended views or broadcasts.
Checkpoint standard: your analytical and numerical gradients should match within tolerance for multiple random initializations. Do not run gradient descent until this checkpoint passes.
Most “gradient descent doesn’t work” reports come down to a small set of mistakes. The goal is to recognize them quickly, test for them, and harden your code so they are unlikely to recur.
1) Sign errors. If you compute residuals as r = y - y_hat but derive gradients assuming r = y_hat - y, you will move uphill. Symptom: loss increases steadily even with small learning rates. Fix: decide residual convention once, then mirror it consistently in loss and gradient.
2) Missing factors (2, 1/n, 1/2). MSE definitions vary. If your loss is mean of squared residuals but your gradient is for sum (or half-mean), updates will be scaled incorrectly. Symptom: learning rate seems “mysteriously” too large or too small. Fix: write the loss formula at the top of your function and derive from that exact expression.
3) Shape mismatches hidden by broadcasting. For example, y shaped (n,1) and y_hat shaped (n,) can broadcast to (n,n) in subtraction in some workflows. Symptom: loss is huge, gradients have wrong shape, but no exception is thrown. Fix: enforce shapes with assertions like assert y.ndim == 1, or explicitly reshape: y = y.reshape(-1).
4) Bias treatment bugs. If you augment X with ones, you might accidentally apply feature scaling or regularization to that bias column. Symptom: bias behaves oddly (stuck near 0, or drifting). Fix: either keep b separate or explicitly exclude the bias index from penalties and scaling.
5) Inconsistent dtype and precision. Gradient checking is sensitive to numerical precision. Symptom: gradient check fails only sometimes, or only for certain ε. Fix: use float64 during checks, then optionally move to float32 for speed later.
Practical workflow: implement analytical gradients, run gradient checking on random small problems, add assertions for shapes, and only then build training loops for batch/mini-batch/stochastic updates. This is the fastest route to reliable convergence later when the models and optimizers get more complex.
1. Why does the chapter emphasize deriving the MSE gradient for linear regression by hand before writing NumPy code?
2. What is the main purpose of finite-difference gradient checking in this chapter?
3. A model trains but converges extremely slowly or diverges. Which implementation issue from this chapter is most likely to cause that behavior?
4. Why does the chapter call out bias terms as needing special handling?
5. What does the chapter’s checkpoint (“match analytical and numerical gradients within tolerance”) provide in practice?
In Chapter 2 you built gradient descent loops and learned to trust them via gradient checking. Now you’ll make those loops reliable in the real world, where “correct” code can still diverge, crawl, or bounce forever. Three knobs dominate day-to-day optimization behavior: the learning rate (how far you step), the batch size (how noisy the direction is), and the stopping rules (when you decide you’re done).
This chapter is intentionally practical. You will run batch gradient descent, SGD, and mini-batch side-by-side on the same dataset, tune learning rates with a repeatable sweep, and add schedules and warmup to stabilize early training. You’ll also implement stopping criteria that respect validation behavior rather than wishful thinking. The goal is not just convergence—it’s fast, stable convergence you can reproduce and explain.
As you work through the sections, keep a simple “tuning log” (a text file or notebook cell) where you record: batch size, base learning rate, schedule, warmup steps, stopping rule, and the best validation metric. This log becomes your engineering memory and the foundation for the checkpoint exercise at the end of the chapter.
Practice note for Compare batch vs. SGD vs. mini-batch on the same dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune learning rates systematically (sweeps and heuristics): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add learning-rate schedules and warmup: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design stopping criteria: patience, thresholds, and max steps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: achieve fast, stable convergence with a documented tuning log: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Compare batch vs. SGD vs. mini-batch on the same dataset: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Tune learning rates systematically (sweeps and heuristics): document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add learning-rate schedules and warmup: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Design stopping criteria: patience, thresholds, and max steps: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: achieve fast, stable convergence with a documented tuning log: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
All three variants compute the same gradient formula; they differ only in how many samples you use to estimate it. In batch gradient descent, you compute the gradient using the full dataset each step. This gives a low-noise direction and smooth learning curves, but each update can be expensive, and you may take fewer steps per second.
In stochastic gradient descent (SGD), you update using one example at a time. SGD is cheap per step and can escape shallow local structure because of its noise, but the path is jagged; a “good” learning rate for SGD is usually smaller than for batch because the gradient estimate is high variance.
Mini-batch gradient descent sits in the middle (e.g., 16–1024 samples per step). It is typically the default in deep learning because it vectorizes well and has manageable noise. You get many steps per epoch (like SGD) but better hardware efficiency and a more stable direction.
To compare them on the same dataset, hold everything else constant: model, initialization, preprocessing, and total number of example-seen (e.g., compare by “epochs” or by “total samples processed”). Plot training loss vs. wall-clock time and vs. steps; the first reveals efficiency, the second reveals algorithmic behavior. A common mistake is comparing “1000 updates” across methods without accounting for the fact that one batch update may process 10,000 examples while one SGD update processes 1.
Practical outcome: you should be able to run a single script that toggles batch_size among {1, 32, N}, logs loss/metric curves, and produces a short note explaining which regime converged fastest and which was most stable.
If you only tune one hyperparameter first, tune the learning rate. It controls stability more than any other knob: too large and the loss explodes or oscillates; too small and training looks “stuck,” even if the gradient is correct. Think of the learning rate as converting a direction (the gradient) into a displacement (the update). When curvature is steep in some directions and flat in others, the maximum stable learning rate is set by the steep directions.
A systematic tuning workflow is more reliable than intuition. Start with a coarse log-scale sweep (for example: 1e-5, 3e-5, 1e-4, 3e-4, ... 1). Run each candidate for a short budget (say 200–1000 steps) and watch for three signals: (1) immediate divergence (loss becomes NaN/inf or increases rapidly), (2) fast initial decrease, (3) stable but slow progress. Pick the largest learning rate that is clearly stable and makes rapid early progress; then refine around it with a smaller sweep (e.g., multiply/divide by 2).
Common mistakes: changing multiple things at once (learning rate and batch size and regularization), judging based on a single noisy mini-batch loss, or using training loss only when overfitting is present. Record your learning-rate sweep results in your tuning log with a consistent format (rate, batch size, steps, final train loss, best validation metric). That documentation turns “trial and error” into an engineering process you can reuse.
A fixed learning rate is rarely ideal from start to finish. Early training often benefits from larger steps to move quickly toward a good region, while later training benefits from smaller steps to avoid bouncing around a minimum. Learning-rate schedules formalize this change over time.
Step decay is the simplest: keep the learning rate constant, then drop it by a factor (often 0.1 or 0.5) at predetermined milestones (epochs or steps). It’s easy to implement and surprisingly effective when you can identify plateaus. A practical heuristic is: if validation metric hasn’t improved for a while (patience), drop the rate once and see if progress resumes.
Exponential decay multiplies the learning rate by a constant factor each step or each epoch, creating a smooth decrease. It’s useful when you want predictable, gradual cooling. Be careful not to decay too fast; otherwise you end up with a learning rate so small that the optimizer stops making meaningful progress long before convergence.
Cosine decay decreases the rate following a cosine curve from an initial value to a minimum value. It tends to be gentle early and more aggressive later, often producing good late-stage refinement. It’s popular because it works well without precise milestone selection.
Warmup is a special case worth treating as standard practice. For the first few hundred or thousand steps, linearly increase the learning rate from a small value to your target base rate. Warmup helps when early gradients are unstable (common with random initialization, normalization layers, or large batch sizes). It reduces early divergence without forcing you to keep the entire run at a tiny rate.
lr(step) to keep your training loop clean.Practical outcome: you should be able to run the same model with (a) fixed rate, (b) step decay, (c) exponential decay, and (d) cosine decay + warmup, then explain which curve gives the best stability early and the best refinement late.
Batch size is not just a throughput choice; it changes the statistics of your gradient estimate. A mini-batch gradient is an unbiased estimate of the full gradient (assuming sampling is representative), but it has variance. Smaller batches increase variance, which can slow convergence in flat directions but can also help exploration and reduce overfitting in some settings.
A practical way to think about it: the learning rate sets the step size, while batch size controls how reliable the direction is. If your batch is tiny, the gradient direction may fluctuate so much that a learning rate that is stable for batch GD becomes unstable for SGD. Conversely, increasing batch size often allows a larger learning rate, but the relationship is not perfectly linear and depends on curvature and model architecture.
To diagnose whether batch size is hurting you, plot (1) training loss, (2) validation metric, and (3) gradient norm (or update norm) over time. If gradient norms explode occasionally with small batches, reduce the learning rate, increase batch size, or add gradient clipping (even if you haven’t “officially” covered it yet, clipping is a practical stabilizer). If progress is smooth but slow with large batches, try a slightly larger learning rate, add a schedule, or reduce batch size to increase update frequency.
Practical outcome: you should be able to justify a chosen batch size not only by GPU memory, but by observed gradient noise and convergence behavior.
Stopping rules are where optimization meets generalization. Training loss will often keep decreasing even after validation performance peaks. Without a stopping rule, you risk wasting compute and overfitting. The most reliable signal is a validation metric measured on data not used for updates.
Early stopping with patience is a robust default: keep training while the validation metric improves; if it fails to improve for patience evaluations, stop. Combine patience with a small minimum improvement threshold (sometimes called min_delta) to avoid stopping due to tiny random fluctuations. Also include a maximum steps/epochs cap to bound cost even if the metric is noisy.
Checkpointing complements early stopping. Save model parameters whenever validation improves (or at regular intervals). Then, when training ends—whether due to patience or max steps—you restore the best checkpoint rather than the final state. This is critical because the best validation point often occurs earlier than the last step.
Common mistakes: stopping based on training loss only, not restoring the best checkpoint, and using a patience that is shorter than the natural oscillation period of your validation curve (especially with small batches). Practical outcome: you should have a training loop that can stop automatically and reliably produce the best-known parameters for downstream evaluation.
Optimization tuning becomes manageable when you treat it like an experiment, not a gamble. The discipline is: change one variable at a time, keep runs short until you find promising regions, and summarize results in an ablation table. This section ties together the chapter’s lessons into a repeatable workflow that leads directly to the chapter checkpoint: “fast, stable convergence with a documented tuning log.”
Start with a baseline configuration: choose mini-batch (e.g., 32 or 128), a simple fixed learning rate from a coarse sweep, and a max-step budget. Confirm basic sanity: loss decreases, no NaNs, gradients are finite. Then perform targeted ablations:
min_delta to see sensitivity.Your ablation table can be simple (a markdown table or CSV) but must be consistent. Recommended columns: run id, batch size, base lr, schedule, warmup steps, patience/min_delta, best val metric, step of best metric, and notes (e.g., “diverged at step 80,” “plateau then improved after decay”). This is the tuning log made actionable: it lets you justify choices and reproduce the best run without guessing.
Finally, define “fast and stable” concretely. For example: reach within 1% of best validation score within X steps, with no divergence events, and with variance in the loss curve below a chosen threshold. When you can state these criteria and point to the run that meets them, you’re no longer just training—you’re engineering an optimizer configuration.
1. If your gradient descent code is correct but training still diverges or bounces forever, which set of controls does Chapter 3 emphasize adjusting first?
2. Why does Chapter 3 have you run batch gradient descent, SGD, and mini-batch side-by-side on the same dataset?
3. What is the main purpose of tuning learning rates with a repeatable sweep rather than guessing values ad hoc?
4. According to Chapter 3, what is the role of learning-rate schedules and warmup?
5. Which stopping approach best matches the chapter’s guidance to respect validation behavior rather than 'wishful thinking'?
Gradient descent often “fails” for reasons that are not bugs in your derivatives or Python loops. The most common culprits are geometric: the loss surface is stretched, tilted, or flattened in ways that make a single learning rate behave poorly across dimensions. This chapter builds practical intuition for conditioning (how curved the surface is in different directions), shows why poorly scaled features can slow or break optimization, and demonstrates how feature scaling and L2 regularization change gradients in your favor.
You will implement standardization, compare optimization trajectories before/after scaling, and add L2 regularization to see its direct effect on update magnitudes. You will also learn stability checks—gradient norms, clipping, and checkpointing—to avoid exploding steps. Finally, you will explore saddle points and flat regions with simple demos and learn what to plot when a model seems “stuck.”
Throughout, treat optimization like engineering: measure, diagnose, then change one thing at a time (scaling, learning rate, regularization, batch size, momentum) and re-measure. The goal is not only convergence, but predictable convergence.
Practice note for Show why poorly scaled features slow or break optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement standardization and compare trajectories: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add L2 regularization and see its effect on gradients: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Explore saddle points and flat regions with simple demos: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: fix a “stuck” model using scaling + regularization + diagnostics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Show why poorly scaled features slow or break optimization: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement standardization and compare trajectories: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add L2 regularization and see its effect on gradients: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Explore saddle points and flat regions with simple demos: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: fix a “stuck” model using scaling + regularization + diagnostics: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
When gradient descent zig-zags across a valley, it is reacting to different curvature along different directions. Imagine a quadratic bowl where one axis is steep and the other is shallow. The gradient points mostly toward the steep wall, so you step across the valley, bounce to the other side, and repeat. Progress along the shallow direction is slow because each step is constrained by the steep direction: if the learning rate is large enough to move quickly along the shallow axis, it will overshoot and diverge along the steep axis.
This is conditioning. For a twice-differentiable loss, local curvature is captured by the Hessian; the ratio of largest to smallest curvature (often approximated by eigenvalues) determines how difficult it is for plain gradient descent. High condition numbers mean you need tiny learning rates for stability, which leads to slow progress in “easy” directions. In linear regression with MSE, poor conditioning often comes directly from feature scales or correlated features.
Practical workflow: (1) suspect conditioning when the loss decreases very slowly despite stable gradients, or when the path oscillates. (2) Verify by plotting parameter trajectories on a contour plot for 2D problems, or by tracking per-parameter update magnitudes for higher dimensions. (3) Apply feature scaling first, then consider momentum/Adam if oscillations persist.
Poorly scaled features are the fastest way to slow or break optimization. If one feature ranges in thousands (e.g., income) and another ranges in tenths (e.g., ratio), a single learning rate produces very different effective step sizes for their corresponding weights. The result is the classic zig-zag: the optimizer overreacts to large-scale features and underreacts to small-scale ones.
Implement standardization (z-scoring) as a preprocessing step: for each feature column x, compute mean μ and standard deviation σ on the training set, then transform to (x−μ)/(σ+ε). Store μ and σ and reuse them for validation/test. In from-scratch code, keep scaling separate from the model so your gradient logic stays clean.
To compare trajectories, run batch gradient descent on the same linear regression problem twice: once on raw features and once on standardized features, using the same initial weights and learning rate. Track loss vs iteration and optionally the weight vector norm. You should see that standardized features allow a larger stable learning rate and reach a lower loss faster. If you plot the 2D contour for a toy two-feature dataset, the standardized case turns a long thin ellipse into a more circular bowl, dramatically reducing oscillation.
L2 regularization (weight decay) is usually introduced as a generalization technique, but it also improves optimization stability. For an objective like J(w)=Loss(w)+ (λ/2)||w||², the gradient becomes ∇J(w)=∇Loss(w)+λw. This extra λw term continuously pulls weights toward zero, discouraging large parameter values that can amplify gradients, especially in poorly scaled or mildly ill-conditioned problems.
Implementation from scratch is straightforward. In linear regression with MSE, if your data gradient is (1/n)Xᵀ(Xw−y), then add λw (but typically exclude the bias term from regularization). In logistic regression, add the same λw to the gradient of the cross-entropy loss. Keep λ as a hyperparameter; typical starting points are 1e−4 to 1e−1 depending on scaling and dataset size.
What you should observe experimentally: with λ>0, the gradient norms often shrink earlier, the training loss may decrease a bit more slowly initially, but updates become less erratic and the final solution is less sensitive to learning-rate choice. If your model “blows up” (weights grow without bound, loss becomes NaN), a modest λ can prevent runaway weights while you fix the underlying scaling issue.
When optimization is unstable, you need instrumentation. Start by logging the gradient norm ||g||₂ each step (or each epoch for batch GD). A sudden jump in ||g|| often precedes divergence. Also log the parameter norm ||w||₂ and the update norm ||Δw||₂=η||g||₂ (or its optimizer-specific equivalent). These three signals quickly distinguish: (a) gradients exploding, (b) learning rate too large, (c) weights drifting due to regularization settings or data issues.
Gradient clipping is a practical safety mechanism: if ||g||₂ > c, rescale g ← g * (c/||g||₂). In plain regression problems, clipping is rarely the best “first fix” (scaling is), but clipping is valuable when you are prototyping and want to avoid NaNs while diagnosing. Choose c relative to typical gradient norms; you can set c to the 95th percentile of observed ||g|| in a stable run, or start with something like 1.0–10.0 after standardization.
Add stability checks to your loop: stop if loss is NaN/inf, if ||w|| grows beyond a sane threshold, or if loss increases for K consecutive steps (useful with deterministic batch GD). Save checkpoints: store best weights so far (lowest validation loss, or lowest training loss if no validation) and restore them if divergence occurs. This turns “I lost the run” into “I learned exactly when and why it failed.”
Even with perfect scaling, optimization can feel “stuck” because non-convex landscapes contain saddle points and flat plateaus. A saddle point has zero gradient but is not a minimum: curvature is positive in some directions and negative in others. In higher dimensions, saddles are more common than poor local minima, so a near-zero gradient does not guarantee you are done.
Build a simple demo to see this behavior. For example, optimize f(x,y)=x²−y² (a saddle at the origin) or f(x,y)=x⁴+y⁴ (very flat near zero). With small learning rates, you will see slow movement in flat regions; with larger learning rates, you may escape but risk instability in steeper areas. In practical models, plateaus often appear when features are redundant or when predictions saturate (e.g., logistic outputs near 0/1), yielding tiny gradients.
How to respond: (1) verify it is truly a plateau by checking gradient norms—are they near zero? (2) try a learning-rate schedule (reduce if oscillating; increase slightly if consistently tiny gradients and stable loss). (3) use momentum or Adam to accumulate small consistent gradients and traverse flat regions more effectively. (4) consider L2 regularization: it can reshape the landscape and discourage drifting along nearly-flat directions. Importantly, scaling still matters; flatness can be an artifact of one feature dominating curvature.
When a model is stuck, plots give you answers faster than more hyperparameter guesses. For low-dimensional toy problems (two parameters, or two features with fixed bias), draw contour lines of the loss and overlay the optimization trajectory (w₁,w₂ over iterations). Poor scaling shows up as long thin contours and a bouncing path; after standardization, contours become more circular and the path becomes smoother and more direct.
For real problems with many parameters, replace contour plots with time-series diagnostics: loss vs iteration, gradient norm vs iteration, update norm vs iteration, and (optionally) per-layer norms if you later extend to neural networks. Combine these with residual plots: for regression, plot residuals y−ŷ against ŷ or against a key feature. If residual variance grows with feature scale, you may have unscaled inputs or targets; if residuals show patterns, the model may be misspecified and optimization improvements alone will not fix it.
Checkpoint exercise (the “stuck” model fix): start with a linear or logistic regression trained with mini-batch GD that shows either oscillation (loss up/down) or stagnation (loss barely decreases). Apply a three-step intervention: (1) standardize features (and optionally scale the target for regression), (2) add L2 regularization excluding the bias term, (3) add diagnostics—log gradient norms, enable early stopping on validation loss, and checkpoint the best weights. Re-run with the same seed. The practical outcome should be a smoother loss curve, fewer unstable updates, and improved reproducibility across learning rates.
1. Why can a single learning rate work poorly when features are poorly scaled?
2. What outcome best indicates that standardization improved optimization behavior?
3. What is the direct effect of adding L2 regularization on gradients during training?
4. A model seems “stuck” with little loss improvement. Based on the chapter, which explanation is most consistent?
5. Which approach best matches the chapter’s recommended engineering workflow for fixing unstable or stalled training?
In earlier chapters you implemented vanilla gradient descent and learned to debug learning rates, curvature issues, and noisy gradients. This chapter upgrades your optimizer toolkit so you can make progress when plain updates stall, zig-zag, or explode. We will build momentum, Nesterov acceleration, RMSProp, and Adam from scratch, then learn how to benchmark them fairly so your conclusions are reproducible and evidence-based.
The theme is simple: vanilla gradient descent uses only the current gradient to decide the next step. Momentum adds memory (a running “velocity”), and adaptive methods rescale each parameter’s step size based on the history of gradient magnitudes. These techniques can dramatically speed up convergence on ill-conditioned problems, stabilize training with mini-batches, and reduce the amount of learning-rate tuning you need. They can also fail in predictable ways if you ignore numerical stability, bias correction, or evaluation fairness.
Throughout the chapter, treat each optimizer as a small, testable component. You will implement a single step(params, grads) interface, log update norms, and checkpoint states (velocity, moving averages) so training can resume without changing behavior.
Practice note for Implement momentum and compare against vanilla GD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add Nesterov acceleration and interpret the lookahead step: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement RMSProp and Adam with bias correction: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Benchmark optimizers across tasks and hyperparameter settings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: pick the right optimizer for a scenario and justify it with evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement momentum and compare against vanilla GD: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add Nesterov acceleration and interpret the lookahead step: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Implement RMSProp and Adam with bias correction: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Benchmark optimizers across tasks and hyperparameter settings: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Checkpoint: pick the right optimizer for a scenario and justify it with evidence: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Momentum addresses a common failure mode of vanilla GD: when the loss surface is shaped like a long, narrow valley, gradients point steeply across the valley and only weakly along it. Vanilla GD bounces side-to-side, wasting steps. Momentum keeps a “velocity” vector that accumulates consistent gradient directions and damps oscillations.
A practical way to view momentum is as an exponential moving average (EMA) of gradients. With parameters w, gradient g_t, learning rate lr, and momentum coefficient beta (often 0.9), the classical update is:
v_t = beta * v_{t-1} + (1 - beta) * g_tw_{t+1} = w_t - lr * v_t
Some libraries omit (1 - beta) and absorb scaling into lr. If you include it, the magnitude of v is comparable across different beta values, which makes tuning easier. Implementation detail: initialize v as zeros with the same shape as each parameter tensor, and store it inside the optimizer state.
||lr * v||. Momentum often reduces gradient norm oscillations and produces smoother loss curves.v when you restart an experiment (or accidentally reusing it across models). That leaks history and makes comparisons invalid.lr first; increasing beta can also worsen overshoot because velocity persists longer.Momentum is a strong default for batch and mini-batch GD. For pure SGD (batch size 1), it can help a lot, but you must watch for runaway velocity when gradients have heavy-tailed noise.
Nesterov accelerated gradient (NAG) modifies momentum by computing the gradient at a “lookahead” position, effectively asking: “If my velocity is about to move me, what gradient will I see there?” This often reduces overshooting and produces more responsive updates near minima.
In code, think in two stages. First compute a provisional lookahead parameter: w_look = w - lr * beta * v (sign conventions vary based on whether v stores an average gradient or an average step). Then compute the gradient at that lookahead: g = grad(loss(w_look)). Finally update the velocity and parameters using that gradient:
v = beta * v + (1 - beta) * gw = w - lr * v
If your training loop separates forward/backward from optimizer stepping, Nesterov requires a small refactor: you either (a) temporarily shift parameters before computing gradients, or (b) compute an equivalent “Nesterov form” update that uses the current gradient but adjusts the step. For learning purposes, the explicit lookahead is clearest and easiest to verify.
lr.When you compare momentum vs Nesterov, keep the same beta and lr first; only then tune. Nesterov’s advantage is frequently “smoother progress,” not necessarily a lower final loss in a fixed number of steps unless the problem is ill-conditioned or the learning rate is near the stability limit.
Momentum fixes directionality and noise averaging, but it still uses one global learning rate. Adaptive optimizers change the effective step size per parameter based on gradient history. This matters when different parameters experience gradients at very different scales (common with unnormalized features, sparse data, or deep networks where layers behave differently).
The core idea is per-parameter scaling: divide the gradient (or velocity) by a running estimate of its magnitude. If a parameter’s gradients are consistently large, its step size is reduced; if they are small, its step size is increased. The simplest form maintains an EMA of squared gradients, s_t:
s_t = rho * s_{t-1} + (1 - rho) * (g_t * g_t)w_{t+1} = w_t - lr * g_t / (sqrt(s_t) + eps)
This is the conceptual bridge to RMSProp and Adam. Two engineering principles matter immediately:
s has the same shape as w. Store it alongside velocity (if used) and checkpoint it. If you restore parameters without restoring s, the effective learning rates change abruptly and training may spike.sqrt(s) has the same units as the gradient, so the ratio g / sqrt(s) becomes roughly unitless, giving you more consistent step sizes across parameters.Adaptive methods often reduce learning-rate sensitivity, but they are not “set and forget.” They can converge to different solutions than SGD with momentum, and they can generalize worse on some tasks. Use them when optimization is the bottleneck (loss won’t go down reliably), and consider switching to SGD+momentum for final fine-tuning if generalization is a priority.
RMSProp is a practical adaptive optimizer that fixes a weakness of earlier Adagrad-style methods: Adagrad’s accumulator of squared gradients grows without bound, shrinking learning rates toward zero. RMSProp replaces the unbounded sum with an EMA, keeping the scale responsive over time.
From scratch, implement RMSProp with three components: (1) an EMA decay rho (commonly 0.9 or 0.99), (2) a global lr, and (3) a small eps added for numerical stability:
s = rho * s + (1 - rho) * (g * g)w = w - lr * g / (sqrt(s) + eps)
The eps term is not optional. Without it, parameters with s ≈ 0 (for example, early in training or in sparse gradients) can produce extremely large steps or division-by-zero errors. In practice, eps is also a “floor” that prevents tiny denominators from amplifying noise.
eps that is too small for float32 training. Values like 1e-8 are common in deep learning, but for some problems (or with poorly scaled inputs) you may need 1e-7 or 1e-6 to avoid jitter.sqrt(s). If the minimum is near zero for many steps, your effective step sizes may be dominated by eps, indicating either sparse gradients or a learning rate that is too high/low for the current scaling.To compare RMSProp against momentum fairly, keep your preprocessing and regularization identical. If RMSProp wins only when features are unscaled, that’s a sign feature scaling was the real fix; the optimizer just compensated.
Adam combines momentum (EMA of gradients) and RMSProp-style scaling (EMA of squared gradients). It is popular because it usually works “out of the box,” but to implement it correctly you must include bias correction. EMAs initialized at zero are biased toward zero at early timesteps; bias correction removes this transient underestimation so early updates are not artificially small.
Adam maintains:
m = beta1 * m + (1 - beta1) * gv = beta2 * v + (1 - beta2) * (g * g)m_hat = m / (1 - beta1^t)v_hat = v / (1 - beta2^t)w = w - lr * m_hat / (sqrt(v_hat) + eps)
Defaults are often beta1=0.9, beta2=0.999, eps=1e-8. In your from-scratch version, be explicit about the timestep t and increment it once per parameter update (not once per epoch). Checkpoint t as well; forgetting it breaks bias correction on resume.
lr for a while, then suddenly destabilize when v adapts. Watch update norms and consider gradient clipping when experimenting near stability limits.lr, adding weight decay correctly (prefer decoupled weight decay, “AdamW”), or switching optimizers for the final phase.lr=1e-3 for Adam, but don’t treat it as sacred. For some losses and feature scalings, 3e-4 or 1e-4 is more stable.If your earlier gradient checking infrastructure is in place, reuse it: Adam’s math is simple, but implementation bugs usually come from shape/broadcast errors, missing bias correction, or not storing optimizer state per parameter.
Choosing the “right optimizer” is an evidence problem, not a preference. A fair benchmark controls compute budget, randomness, and evaluation metrics so that momentum, Nesterov, RMSProp, and Adam are compared on equal footing across tasks and hyperparameter settings.
Start by defining the budget: either (a) a fixed number of parameter updates (best when comparing batch sizes), or (b) a fixed wall-clock time (best when implementations differ in cost). Adaptive methods add extra per-parameter operations, so “same epochs” is often misleading: two runs with the same epochs can represent different numbers of updates and different compute.
1/(sqrt(v_hat)+eps) summaries). These reveal whether an optimizer is progressing or merely producing smaller steps.lr for Adam but not for momentum, you are benchmarking your tuning effort, not the algorithm. Use comparable search ranges and the same early-stopping rule.For the checkpoint decision in real projects, write down the scenario and the evidence. Example justifications: “Mini-batch gradients are noisy and the loss is non-stationary, RMSProp reduced oscillations and reached the target loss in half the updates,” or “SGD+Nesterov matched Adam’s training loss but generalized better at equal compute.” By the end of this chapter, your optimizer choice should be a reproducible experiment: code, seeds, curves, and a clear statement of why one method fits the problem’s constraints.
1. Why can momentum converge faster than vanilla gradient descent on ill-conditioned problems?
2. What is the key idea behind Nesterov acceleration compared to standard momentum?
3. How do adaptive optimizers like RMSProp and Adam differ from vanilla gradient descent in how they choose step sizes?
4. Why is bias correction important when implementing Adam from scratch?
5. Which practice best supports fair and reproducible benchmarking of optimizers across tasks and hyperparameters?
This capstone chapter turns your gradient descent knowledge into a complete, reproducible training workflow. You will build two models from scratch—logistic regression and a tiny 2-layer MLP—train them using a unified optimizer interface, and validate your derivatives with numerical gradient checking. The goal is not merely to “get it to run,” but to make it debuggable: you should be able to explain why training is slow, why it diverges, why it overfits, and what intervention fixes it.
We will also adopt professional habits: consistent data splits, fixed random seeds, logging, and plots that reveal optimization behavior. By the end, you will produce a small training report that includes learning curves, a calibration sanity check, and conclusions about which optimizer and learning-rate strategy worked best for your setup.
Throughout, use the same dataset and preprocessing so comparisons are meaningful. A classic choice is a binary classification problem with standardized features (mean 0, variance 1). If you already have a dataset from earlier chapters, reuse it—this chapter is about process and correctness as much as it is about performance.
Practice note for Build logistic regression training with cross-entropy loss: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add a tiny MLP and train with your custom optimizer interface: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run gradient checking on a subset to validate backprop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a debugging playbook for divergence and overfitting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Final checkpoint: deliver a reproducible training report with plots and conclusions: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Build logistic regression training with cross-entropy loss: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Add a tiny MLP and train with your custom optimizer interface: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Run gradient checking on a subset to validate backprop: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Practice note for Create a debugging playbook for divergence and overfitting: document your objective, define a measurable success check, and run a small experiment before scaling. Capture what changed, why it changed, and what you would test next. This discipline improves reliability and makes your learning transferable to future projects.
Start with logistic regression because it is the simplest nontrivial end-to-end training pipeline: a linear model, a sigmoid, and cross-entropy loss. This baseline is your “truth serum”: if you cannot make logistic regression converge reliably, the issue is almost always in data handling, learning rate, or loss/gradient implementation.
Let features be X (shape [N, D]), labels y in {0,1} (shape [N]), weights w (shape [D]), and bias b (scalar). Compute logits z = X @ w + b and probabilities p = sigmoid(z). Use numerically stable sigmoid (e.g., branching on sign or using np.clip on logits) to avoid overflow. The average cross-entropy loss is:
L = -mean( y * log(p) + (1-y) * log(1-p) )
The key gradient identity is that for cross-entropy with sigmoid, the derivative simplifies: dL/dz = (p - y) / N. Then:
grad_w = X.T @ (p - y) / Ngrad_b = sum(p - y) / NCommon implementation mistakes: mixing shapes (treating y as column vs. flat vector), forgetting the 1/N averaging factor (affects learning-rate scale), and taking log(0) because p hits exactly 0 or 1. Fix the last issue by computing log(p + eps) and log(1-p + eps) with a small eps (e.g., 1e-12) or by clamping p into [eps, 1-eps].
Before any fancy optimizers, confirm that batch gradient descent (full dataset each step) decreases loss monotonically at a conservative learning rate (e.g., 0.1 with standardized features, but tune). Then test stochastic and mini-batch modes; you should see noisier loss curves but faster “wall clock” progress per epoch. This gives you a clean reference point for the rest of the chapter.
Next, add a tiny MLP to exercise backpropagation while keeping the graph small enough to reason about. Use a 2-layer network: an input-to-hidden affine layer, a nonlinearity, then hidden-to-output affine, then sigmoid + cross-entropy for binary classification. For example: hidden size H=16 or 32.
Forward pass (mini-batch size B):
a1 = X @ W1 + b1 with W1: [D,H], b1: [H]h = relu(a1) (or tanh if you want smoother gradients)z2 = h @ W2 + b2 with W2: [H,1], b2: [1]p = sigmoid(z2), loss is batch mean cross-entropyBackward pass: reuse the logistic regression simplification at the output. With dz2 = (p - y)/B (shape [B,1]):
grad_W2 = h.T @ dz2, grad_b2 = sum(dz2, axis=0)dh = dz2 @ W2.Tda1 = dh * relu'(a1) where relu'(a1)=1 if a1>0 else 0grad_W1 = X.T @ da1, grad_b1 = sum(da1, axis=0)Engineering judgement: start with ReLU for speed, but know the failure mode—dead ReLUs if learning rate is too high or initialization shifts a1 negative for most samples. If you see training stall early with many zero activations, reduce the learning rate, use He initialization (std=sqrt(2/D)), or try tanh as a diagnostic (tanh is less likely to “die,” though it can saturate).
Keep the first MLP intentionally small. The goal is not state-of-the-art accuracy; it is building a backprop pipeline you trust. Once gradients are correct and training is stable, scaling up is straightforward.
To compare batch, stochastic, mini-batch, and advanced optimizers fairly, you need one training loop that does not care which model or optimizer it drives. A practical pattern is: models expose parameters and gradients; optimizers update parameters in-place given those gradients. This makes it easy to swap SGD, Momentum, Nesterov, RMSProp, and Adam without rewriting training code.
Recommended interfaces:
forward(X), loss_and_gradients(X, y), params() returning a dict of arrays, and grads() returning matching dictstep(params, grads) and optional zero_state() / state dict keyed by parameter nameYour training loop should be explicit about: shuffling, batching, loss aggregation, evaluation mode, and stopping. A robust skeleton is: (1) set seed, (2) split train/val, (3) standardize using train statistics only, (4) for each epoch: shuffle indices, iterate mini-batches, compute loss+grads, call optimizer.step, log metrics, then run a validation pass.
Include learning-rate schedules as first-class objects. Even a simple step decay (drop LR by factor 0.1 after plateau) can turn a “noisy but stuck” run into convergence. For stopping criteria, combine a max-epoch limit with one stability rule: e.g., stop if validation loss has not improved for K epochs (early stopping). If you use regularization (L2 weight decay), implement it consistently: either add lambda * w to gradients (classic) or use decoupled weight decay in Adam-like optimizers (more modern). Don’t mix approaches accidentally.
Finally, ensure you can switch between batch, stochastic, and mini-batch by changing only batch_size. If changing batch size requires code changes elsewhere, debugging will be harder and comparisons will be misleading.
“It trains” is not enough; you need instrumentation that explains how it trains. At minimum, log per-epoch training loss, validation loss, training accuracy, and validation accuracy. Plot them. A stable run typically shows training loss decreasing smoothly (or noisily for SGD) and validation loss decreasing then flattening. When validation loss rises while training loss continues to fall, you are overfitting.
Add two deeper monitors that catch subtle issues:
||grad|| and ||param|| per layer. Exploding norms signal too-large learning rate, missing averaging by batch size, or a bug in backprop. Vanishing norms may indicate saturation (sigmoid/tanh), dead ReLUs, or overly strong regularization.Also watch for metric disagreements. If loss decreases but accuracy stagnates, the model may be improving probability estimates around the decision boundary without flipping many predicted labels. That can be okay, but it can also mean your threshold (0.5) is inappropriate for class imbalance; log precision/recall if imbalance exists.
Keep plots tied to experimental settings. Every run should record: optimizer, learning rate, schedule, batch size, regularization strength, initialization, and seed. Without this, plots become decoration rather than tools for engineering decisions.
When training fails, guessing is expensive. Use a fixed playbook that narrows the search space quickly. Work from the outside inward: data → loss → gradients → optimizer → hyperparameters → code structure.
~0.69 for balanced binary labels (log(2)). If it is nan or huge, your sigmoid/log is unstable or inputs are unscaled.(L(theta+eps)-L(theta-eps))/(2eps) with eps=1e-5. Check a handful of random parameter entries per tensor. Relative error |g-g_num| / max(1,|g|,|g_num|) should be small (often 1e-4 to 1e-6 depending on eps and dtype).Overfitting debugging is its own branch: if training improves but validation degrades, try (1) stronger L2 regularization, (2) early stopping, (3) smaller hidden size, and (4) more data or data augmentation (if applicable). Don’t “fix” overfitting by lowering learning rate alone; that often just slows the same trajectory.
Finally, audit code for silent bugs: accidentally reusing stale gradients, not resetting optimizer state between runs, mixing train/val in preprocessing, or using the wrong axis in reductions (a frequent source of shape-correct but numerically wrong gradients).
The final checkpoint is a reproducible training report that another person (or future you) can rerun and trust. Treat this as a deliverable: a single command should regenerate metrics and plots from scratch. In practice, this means controlling randomness, recording configuration, and saving artifacts.
Your report should include:
Save raw logs to a machine-readable format (CSV/JSON) and include the exact code version (git commit hash if possible). For reproducibility, fix seeds for NumPy and any other RNG you use, and record the Python and library versions. If you implement mini-batch shuffling, ensure the shuffle is seeded per run so you can reproduce a trajectory when debugging.
Next steps: extend the same framework to multiclass softmax regression, add batch normalization (to study conditioning), or compare learning-rate schedules (cosine decay, warmup). The important part is that you now have a disciplined optimization harness: correct gradients, consistent training loops, and a debugging methodology that scales as models get deeper and datasets get larger.
1. What is the main purpose of the capstone workflow in Chapter 6 beyond getting the code to run?
2. Why does Chapter 6 have you implement both logistic regression and a tiny 2-layer MLP?
3. What is the role of numerical gradient checking in this chapter?
4. Which practice is most aligned with making optimizer comparisons meaningful in Chapter 6?
5. What should the final deliverable training report include according to Chapter 6?