Skip to content

Deep Learning Optimizers: From SGD to AdamW

DS
LDS Team
Let's Data Science
17 minAudio
Listen Along
0:00/ 0:00
AI voice

Training a neural network means finding millions of parameters that minimize a loss function. The optimizer determines how fast you get there, whether you get there at all, and how well the final model generalizes. Choose wrong, and your transformer stalls at a mediocre loss for days. Choose right, and you converge in half the compute budget.

Deep learning optimizers have evolved from simple gradient descent into algorithms that adapt per-parameter learning rates and decouple regularization from the update step. This article traces that evolution from vanilla SGD through AdamW (the optimizer that dominates production training in 2026), with every update rule, the intuition behind it, and PyTorch code you can drop into your next project.

We'll use one running example throughout: training a small 3-layer MLP (with ReLU activations) on a synthetic regression task, comparing how each optimizer handles the same loss surface.

Vanilla SGD and Its Fundamental Limitations

Stochastic Gradient Descent computes the gradient of the loss with respect to every parameter, then nudges each parameter in the opposite direction.

θt+1=θtηθL(θt)\theta_{t+1} = \theta_t - \eta \cdot \nabla_\theta \mathcal{L}(\theta_t)

Where:

  • θt\theta_t is the parameter vector at step tt
  • η\eta is the learning rate (a scalar, same for every parameter)
  • θL(θt)\nabla_\theta \mathcal{L}(\theta_t) is the gradient of the loss with respect to θ\theta at step tt

In Plain English: Measure the slope of the loss surface at your current position, then take a step downhill proportional to the steepness. Steeper slope means a bigger step.

This works, but it has real problems. The same learning rate applies to every parameter, whether it's a frequently updated embedding or a rarely activated bias term. On elongated loss surfaces (common in deep networks), SGD oscillates across the narrow dimension while crawling along the wide one, making convergence painfully slow.

python
import torch
import torch.nn as nn

# Running example: 3-layer MLP on synthetic regression (ReLU activations)
model = nn.Sequential(
    nn.Linear(10, 64), nn.ReLU(),
    nn.Linear(64, 32), nn.ReLU(),
    nn.Linear(32, 1)
)

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

For our MLP example, vanilla SGD with lr=0.01 takes roughly 3x more epochs to converge than the adaptive methods ahead. SGD doesn't produce a worse final model (this connects to the bias-variance tradeoff); it just gets there slowly.

SGD with Momentum Borrows from Physics

Momentum fixes SGD's oscillation problem by adding a "velocity" term that accumulates past gradients. Think of a ball rolling down a bowl: it builds speed in consistent directions and dampens side-to-side wobble.

vt=βvt1+θL(θt)v_t = \beta \cdot v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t)

θt+1=θtηvt\theta_{t+1} = \theta_t - \eta \cdot v_t

Where:

  • vtv_t is the velocity (exponential moving average of gradients) at step tt
  • β\beta is the momentum coefficient, typically 0.9
  • θL(θt)\nabla_\theta \mathcal{L}(\theta_t) is the current gradient
  • η\eta is the learning rate

In Plain English: Instead of reacting only to the current slope, the optimizer remembers which direction it has been heading. If gradients consistently point left, the ball picks up speed going left. If they oscillate, the opposing forces cancel out and the ball rolls smoothly toward the minimum.

With β=0.9\beta = 0.9, the effective step size in a consistent gradient direction is roughly η1β=10η\frac{\eta}{1 - \beta} = 10\eta, giving the optimizer much more force along the dominant direction.

python
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

On our MLP example, adding momentum cuts training time by about 40% compared to vanilla SGD. The loss curve becomes noticeably smoother as oscillations across narrow dimensions get damped out.

Key Insight: Momentum is the single biggest improvement you can make to SGD. If someone says "SGD" in a modern context, they almost always mean SGD with momentum.

Nesterov Accelerated Gradient Looks Ahead

Nesterov momentum (1983) makes one clever modification: compute the gradient at the predicted next position instead of the current one. Since momentum will carry us to θtηβvt1\theta_t - \eta \beta v_{t-1} anyway, we compute the gradient there instead.

vt=βvt1+θL(θtηβvt1)v_t = \beta \cdot v_{t-1} + \nabla_\theta \mathcal{L}(\theta_t - \eta \beta \cdot v_{t-1})

θt+1=θtηvt\theta_{t+1} = \theta_t - \eta \cdot v_t

Where:

  • θtηβvt1\theta_t - \eta \beta \cdot v_{t-1} is the "lookahead" position
  • All other terms are the same as standard momentum

In Plain English: Before stepping, Nesterov momentum asks: "If I let my velocity carry me forward, what does the slope look like there?" If the gradient at the lookahead position says "you've gone too far," the correction kicks in earlier than with standard momentum.

The practical difference is modest but consistent: Nesterov converges 5-10% faster and overshoots less near the minimum.

python
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

Adagrad Introduces Per-Parameter Learning Rates

Adagrad (Duchi et al., 2011) takes a different approach: instead of using the same learning rate for every parameter, it tracks how much each parameter's gradient has varied and scales the rate accordingly.

Gt=Gt1+(θL(θt))2G_t = G_{t-1} + (\nabla_\theta \mathcal{L}(\theta_t))^2

θt+1=θtηGt+ϵθL(θt)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \cdot \nabla_\theta \mathcal{L}(\theta_t)

Where:

  • GtG_t is the sum of squared past gradients (per-parameter, element-wise)
  • ϵ\epsilon is a small constant (typically $10^{-8}$) to prevent division by zero
  • The division is element-wise: each parameter gets its own effective learning rate

In Plain English: Parameters that receive large, frequent gradients (common word embeddings) get their learning rate shrunk quickly. Parameters that receive small, infrequent gradients (rare word embeddings) keep a larger rate. This is ideal for sparse data, which made Adagrad a breakthrough for NLP.

The fatal flaw: GtG_t only grows. As training progresses, the accumulated squared gradients get so large that the effective learning rate drops to near zero, and learning stops entirely. For deep networks that need many epochs, Adagrad is unusable on its own.

RMSprop Fixes Adagrad's Decaying Learning Rate

RMSprop (Hinton, 2012) solves Adagrad's shrinking learning rate by using an exponential moving average of squared gradients instead of their sum.

E[g2]t=γE[g2]t1+(1γ)(θL(θt))2E[g^2]_t = \gamma \cdot E[g^2]_{t-1} + (1 - \gamma) \cdot (\nabla_\theta \mathcal{L}(\theta_t))^2

θt+1=θtηE[g2]t+ϵθL(θt)\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{E[g^2]_t} + \epsilon} \cdot \nabla_\theta \mathcal{L}(\theta_t)

Where:

  • E[g2]tE[g^2]_t is the exponential moving average of squared gradients
  • γ\gamma is the decay rate, typically 0.99
  • ϵ\epsilon is a small constant (typically $10^{-8}$)

In Plain English: RMSprop forgets old squared gradients exponentially. If a parameter's gradients were huge 1000 steps ago but small now, the denominator reflects recent reality, not ancient history. The effective learning rate can increase again if gradients become small, something Adagrad could never do.

Hinton introduced RMSprop in Lecture 6e of his Coursera course, never as a formal paper. Despite this informal origin, it became one of the most widely used optimizers before Adam arrived.

python
optimizer = torch.optim.RMSprop(model.parameters(), lr=0.001, alpha=0.99)

Adam Combines Momentum with Adaptive Rates

Adam (Kingma and Ba, 2015) merges momentum and RMSprop. It maintains two running averages: the first moment (mean of gradients) and the second moment (mean of squared gradients).

mt=β1mt1+(1β1)gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t

vt=β2vt1+(1β2)gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

θt+1=θtηm^tv^t+ϵ\theta_{t+1} = \theta_t - \frac{\eta \cdot \hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}

Where:

  • mtm_t is the first moment estimate (exponential moving average of gradients)
  • vtv_t is the second moment estimate (exponential moving average of squared gradients)
  • β1=0.9\beta_1 = 0.9 controls the first moment decay (momentum-like)
  • β2=0.999\beta_2 = 0.999 controls the second moment decay (RMSprop-like)
  • m^t\hat{m}_t and v^t\hat{v}_t are bias-corrected estimates
  • gt=θL(θt)g_t = \nabla_\theta \mathcal{L}(\theta_t) is the gradient at step tt

In Plain English: Adam tracks both the direction gradients have been pointing (first moment, giving it momentum) and how large those gradients have been (second moment, giving it per-parameter scaling). Bias correction compensates for both estimates starting at zero, which would otherwise make early updates too small.

Bias correction matters significantly in early training. Without it, the first few updates would be scaled by a nearly-zero denominator, causing instability.

Optimizer evolution from SGD to modern methods showing key innovations at each stepClick to expandOptimizer evolution from SGD to modern methods showing key innovations at each step

python
optimizer = torch.optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

Common Pitfall: Adam with L2 regularization does not behave the same as Adam with weight decay. Getting this wrong can cost 1-2% accuracy on transformer models.

AdamW Decouples Weight Decay from the Gradient

AdamW (Loshchilov and Hutter, 2019) fixes a subtle but critical bug in how Adam handles weight decay.

Standard Adam with L2 regularization adds the weight decay term to the gradient before the adaptive scaling. The regularization gets scaled by the inverse of the second moment, effectively reducing its strength for parameters with large gradients.

AdamW decouples weight decay by applying it directly to the parameters after the Adam update:

mt=β1mt1+(1β1)gtm_t = \beta_1 \cdot m_{t-1} + (1 - \beta_1) \cdot g_t

vt=β2vt1+(1β2)gt2v_t = \beta_2 \cdot v_{t-1} + (1 - \beta_2) \cdot g_t^2

m^t=mt1β1t,v^t=vt1β2t\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}

θt+1=θtη(m^tv^t+ϵ+λθt)\theta_{t+1} = \theta_t - \eta \left( \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon} + \lambda \theta_t \right)

Where:

  • λ\lambda is the weight decay coefficient (typically 0.01 to 0.1)
  • All other terms are the same as Adam
  • The key difference: λθt\lambda \theta_t is added outside the adaptive scaling, not inside it

In Plain English: In standard Adam + L2, the penalty for large weights gets processed through the same adaptive machinery as the gradients, warping it unpredictably. AdamW applies weight decay as a separate step: "shrink all weights by a fixed fraction each step, regardless of what the gradients are doing." This produces the regularization the researcher actually intended.

AdamW is the default optimizer for virtually every transformer trained in 2026: GPT-series models, Llama, Gemma, Mistral, and the majority of vision transformers. The original paper demonstrated that decoupled weight decay consistently improves generalization, especially combined with warmup and cosine annealing.

python
optimizer = torch.optim.AdamW(
    model.parameters(),
    lr=1e-3,
    betas=(0.9, 0.999),
    weight_decay=0.01
)

Comparison of how optimizers traverse a loss surface from SGD zigzag to Adam smooth convergenceClick to expandComparison of how optimizers traverse a loss surface from SGD zigzag to Adam smooth convergence

Optimizer Comparison at a Glance

OptimizerAdaptive LRMomentumWeight DecayMemory (per param)Best For
SGDNoNoL20Baselines
SGD + MomentumNoYesL21 floatCNNs, image classifiers
Nesterov SGDNoLookaheadL21 floatSlight edge over momentum
AdagradYesNoL21 floatSparse features, NLP
RMSpropYesNoL21 floatRNNs, unstable gradients
AdamYesYesL2 (coupled)2 floatsGeneral purpose
AdamWYesYesDecoupled2 floatsTransformers, LLMs, ViTs

Pro Tip: Adam stores two state variables per parameter (first and second moments), doubling optimizer memory versus SGD. For a 7B-parameter model in float32, that is an extra 56 GB of optimizer state, which is why 8-bit Adam and memory-efficient optimizers matter at scale.

Learning Rate Schedules Shape the Training Trajectory

The learning rate schedule often matters more than the optimizer itself. Even AdamW with perfect hyperparameters will underperform if the learning rate stays constant.

Step Decay

Multiply the learning rate by a factor (typically 0.1) every NN epochs. Common in older CNN training recipes (ResNet, VGG). It works, but the sharp drops can cause instability, and you need to manually pick when to drop.

Cosine Annealing

The learning rate follows a half-cosine curve from the initial value down to near zero:

ηt=ηmin+12(ηmaxηmin)(1+cos(tπT))\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})\left(1 + \cos\left(\frac{t \cdot \pi}{T}\right)\right)

Where:

  • ηmax\eta_{max} is the peak learning rate
  • ηmin\eta_{min} is the minimum (often 0 or ηmax/100\eta_{max} / 100)
  • tt is the current step
  • TT is the total number of training steps

In Plain English: The learning rate starts high and gently curves down to near zero by the end of training. Unlike step decay, there are no abrupt drops. The cosine shape naturally spends more time at moderate learning rates and less time at the extremes.

Cosine annealing is the default schedule for most LLM pre-training runs in 2026.

Linear Warmup

Start with a very small learning rate and linearly increase to the target over the first NN steps (typically 1-5% of total training). Warmup prevents the optimizer's zero-initialized adaptive estimates from causing wild initial updates.

python
from torch.optim.lr_scheduler import CosineAnnealingLR, LinearLR, SequentialLR

optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3, weight_decay=0.01)

# Warmup for 1000 steps, then cosine decay for the remaining 49000
warmup = LinearLR(optimizer, start_factor=0.01, total_iters=1000)
cosine = CosineAnnealingLR(optimizer, T_max=49000, eta_min=1e-5)
scheduler = SequentialLR(optimizer, schedulers=[warmup, cosine], milestones=[1000])

One-Cycle Policy

Smith's one-cycle policy (2018) ramps the learning rate up to a peak over the first half of training, then back down to near zero, enabling "super-convergence" at up to 10x faster training.

The high learning rate phase acts as implicit regularization, preventing the model from settling into sharp minima. The cool-down phase then fine-tunes into a broad minimum.

python
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, max_lr=1e-2, total_steps=50000, pct_start=0.3
)

Comparison of learning rate schedules showing constant, step decay, cosine annealing, and one-cycle shapesClick to expandComparison of learning rate schedules showing constant, step decay, cosine annealing, and one-cycle shapes

Pro Tip: For fine-tuning (BERT, ViT, Llama), cosine annealing with warmup is the safest default. For training from scratch on a deadline, the one-cycle policy often finds a good solution faster.

Modern Optimizers Beyond AdamW

AdamW remains dominant in 2026, but several newer optimizers have gained traction for specific use cases.

LION (EvoLved Sign Momentum)

LION (Chen et al., 2023) was discovered by Google Brain using program search. Instead of scaling by the second moment, LION uses only the sign of the momentum:

θt+1=θtη(sign(β1mt+(1β1)gt)+λθt)\theta_{t+1} = \theta_t - \eta \cdot (\text{sign}(\beta_1 m_t + (1-\beta_1) g_t) + \lambda \theta_t)

LION uses 50% less memory than Adam (no second moment), needs a 3-10x smaller learning rate, and produces uniform-magnitude updates. Results are competitive with AdamW on language models up to 7.5B parameters, though gains vary by architecture.

Sophia (Second-Order Clipped Optimization)

Sophia (Liu et al., 2023) approximates the diagonal of the Hessian to pre-condition gradients, giving it second-order information that Adam's second moment merely approximates. On GPT-2 scale experiments, Sophia showed up to 2x speedup in total compute to reach the same loss as AdamW.

Schedule-Free Optimizers

Schedule-Free AdamW (Defazio et al., NeurIPS 2024 Oral) replaces momentum with a combination of interpolation and averaging, eliminating the need to specify a learning rate schedule or the total number of training steps in advance. It won the MLCommons 2024 AlgoPerf Self-Tuning track and matches or outperforms cosine decay schedules without any schedule tuning. The schedulefree PyTorch package from Meta FAIR provides drop-in replacements.

Muon (Momentum Orthogonalized by Newton-Schulz)

Muon (Jordan, 2024) runs standard Nesterov momentum, then orthogonalizes the update matrix using 5 steps of Newton-Schulz iteration. The orthogonalization constrains updates to lie on the Stiefel manifold, which empirically improves training speed for transformers. Scaling law experiments show Muon achieves roughly 2x computational efficiency compared to AdamW with compute-optimal training. Muon was added to PyTorch core as torch.optim.Muon in PyTorch 2.9 and is gaining rapid adoption in the open-source LLM community, with variants like Turbo-Muon and AdaMuon already emerging.

Choosing the Right Optimizer for Your Architecture

ArchitectureRecommended OptimizerLearning RateScheduleWeight Decay
Transformer / LLMAdamW1e-4 to 3e-4Cosine + warmup0.01 - 0.1
Vision Transformer (ViT)AdamW1e-3 to 3e-4Cosine + warmup0.05 - 0.3
CNN (ResNet, EfficientNet)SGD + Momentum (0.9)0.01 - 0.1Step or cosine1e-4
Fine-tuning pre-trainedAdamW1e-5 to 5e-5Linear warmup + decay0.01
Small model / quick experimentAdam1e-3None or cosine0
Memory-constrainedLION1e-4 (3-10x smaller)Cosine0.01

Key Insight: SGD with momentum still produces the best-generalizing models for image classification when you have the compute budget to tune the schedule. For anything involving attention mechanisms, AdamW is the clear winner because adaptive per-parameter rates handle the wildly different gradient magnitudes across attention heads, layer norms, and feed-forward layers far better than a single global rate.

When NOT to Use AdamW

AdamW is not always the right call. For simple CNNs on CIFAR-10, SGD with momentum will likely match or beat it with half the memory. For short training runs (under 10K steps), the adaptive estimates won't have time to stabilize, and SGD may converge faster. For reinforcement learning, vanilla Adam (without weight decay) is often preferred because RL loss surfaces differ fundamentally from supervised learning.

Hyperparameter Tuning Tips

Learning rate is the single most important hyperparameter. Start with the defaults from the table above and use a learning rate finder (sweep from 1e-7 to 1 on a log scale for 100 steps; pick the rate where loss decreases fastest).

Weight decay interacts with the learning rate. The ratio λ/η\lambda / \eta is what actually matters for effective regularization strength. Start with 0.01 and increase if you see overfitting.

Beta values rarely need tuning. The defaults of β1=0.9\beta_1 = 0.9 and β2=0.999\beta_2 = 0.999 work across the vast majority of tasks. The one exception: for training transformers and LLMs, some practitioners lower β2\beta_2 to 0.95 to make the second moment estimate more responsive.

Epsilon almost never needs changing from $10^{-8}. The one exception is mixed-precision training with float16, where increasing it to $10^{-5} avoids division-by-zero in the denominator.

For a deeper dive into systematic tuning strategies, see our guide to hyperparameter tuning.

Conclusion

The evolution from SGD to AdamW traces a clear arc: each generation solved a specific failure mode of its predecessor. Momentum eliminated oscillation. Adaptive rates handled heterogeneous gradient scales. Decoupled weight decay fixed the accidental coupling between regularization and adaptation in standard Adam.

For most practitioners in 2026, AdamW with cosine annealing and warmup is the starting point for any new project involving attention-based architectures. If you're working with CNNs and can afford the tuning time, SGD with Nesterov momentum and a carefully chosen schedule often produces better-generalizing models.

Understanding why each optimizer works is what separates engineers who debug training failures from those who blindly copy config files. If you're building networks from scratch, our guide to building neural networks in Python pairs well with this material, and our backpropagation deep dive covers the gradient computation that every optimizer depends on. For a complementary look at LLM inference parameters, that article is a natural companion.

The best optimizer is the one you understand deeply enough to debug at 2 AM when your training run diverges.

Interview Questions

Why does Adam need bias correction, and what happens if you skip it?

Adam's moment estimates are initialized at zero. Without bias correction, early updates divide by a near-zero denominator, causing enormous parameter jumps. The correction terms 11βt\frac{1}{1-\beta^t} compensate for this and matter most in the first few hundred steps. The effect diminishes as tt grows.

Explain the difference between L2 regularization and decoupled weight decay.

L2 adds the weight penalty to the gradient before Adam's adaptive scaling, so the penalty is scaled differently per parameter based on gradient history. AdamW applies weight decay directly to the weights after the adaptive update, preserving the intended regularization strength. AdamW generalizes better on transformers as a result.

When would you prefer SGD over Adam?

SGD with momentum often generalizes better on CNNs, particularly ImageNet-class benchmarks. It finds flatter minima that transfer better. The tradeoff: SGD requires more careful scheduling and converges slower. If compute budget is unlimited and you need maximum test accuracy, SGD with a cosine schedule remains strong.

A transformer's training loss spikes at step 5000. What optimizer-related issues would you investigate?

Check if learning rate warmup ends around step 5000. Look at gradient norm statistics for explosion. Examine whether weight decay is too aggressive relative to the learning rate. Check for float16 numerical instability where ϵ\epsilon may need increasing.

What does the second moment in Adam measure?

The second moment vtv_t tracks the exponential moving average of squared gradients per parameter, approximating gradient variance. Parameters with highly variable gradients get a larger denominator, shrinking their effective rate. This per-parameter scaling is why Adam handles heterogeneous architectures better than a single global rate.

How does the one-cycle policy achieve super-convergence?

It ramps the learning rate up during the first phase, acting as implicit regularization that prevents settling into sharp minima. The cool-down phase then converges into a broad, flat minimum. This exploration-exploitation combination can train models up to 10x faster than constant learning rates.

How can you reduce optimizer memory for a 13B-parameter model without switching to SGD?

Use 8-bit AdamW (bitsandbytes), which quantizes moment states to 8-bit integers, cutting optimizer memory by 75%. LION stores only the first moment, cutting memory by 50%. Muon offers 2x compute efficiency over AdamW with reasonable memory.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths