Every time you train a neural network, backpropagation is doing the heavy lifting. It's the algorithm that answers the most important question in optimization: how should each weight change to reduce the loss? Without it, deep learning simply wouldn't exist. Since Rumelhart, Hinton, and Williams published their landmark 1986 paper in Nature, backpropagation has remained the backbone of neural network training, powering everything from image classifiers to the large language models behind modern AI assistants.
We'll work through backpropagation from the ground up using one consistent example: a tiny 2-layer network that predicts a single output from two inputs. You'll see the forward pass, loss computation, and backward pass with actual numbers you can verify on paper, then connect that manual math to the automatic differentiation engines that handle the work in practice.
The Chain Rule Powers Everything
Backpropagation is just the chain rule from calculus applied systematically to a computational graph. If a neural network is a sequence of composed functions, the chain rule tells you how to decompose the derivative of the entire composition into a product of local derivatives at each step.
Consider three functions composed together: . The chain rule says:
Where:
- is the derivative of the outermost function with respect to its input
- is the derivative of the middle function with respect to its input
- is the derivative of the innermost function with respect to the original input
In Plain English: Each layer in a neural network is one of these composed functions. Backpropagation multiplies the local gradients at each layer together, starting from the loss and working backward to the inputs. It's like asking "if I wiggle this weight by a tiny amount, how much does the final loss change?" and getting the answer by multiplying a chain of small effects.
This decomposition is what makes backpropagation efficient. Instead of computing each weight's derivative independently (requiring a separate forward pass each time), you compute all of them in a single backward pass by reusing intermediate results.
Click to expandComputational graph showing forward pass values and backward pass gradients through a two-layer network
A Concrete Network with Real Numbers
The best way to understand backpropagation is to do it by hand. Here's our running example: a 2-layer network with two inputs, one hidden layer (two neurons), and one output neuron. No activation function on the output, and sigmoid activation on the hidden layer.
Network setup:
- Inputs: ,
- Hidden layer weights: , , ,
- Hidden biases: ,
- Output weights: ,
- Output bias:
- Target:
| Symbol | Value | Description |
|---|---|---|
| 0.5, 0.8 | Input features | |
| 0.3, -0.1 | Weights to hidden neuron 1 | |
| 0.5, 0.2 | Weights to hidden neuron 2 | |
| 0.4, -0.3 | Weights from hidden to output | |
| 0.1, -0.2, 0.05 | Biases |
The Forward Pass Computes Predictions
The forward pass pushes data through the network layer by layer, computing each neuron's output. For our network, the hidden layer computes two values:
Where:
- is the pre-activation value for hidden neuron 1
- are the weights connecting inputs to this neuron
- are the input values
- is the bias term
In Plain English: We're computing a weighted sum of the inputs plus a bias. For our numbers: .
Hidden neuron 1:
Hidden neuron 2:
Output neuron:
The network predicted 0.1013 when the target is 1.0. That's way off, which means the loss will be large and the gradients will push the weights hard.
Loss Computation Measures the Error
Loss functions quantify how wrong the prediction is. We'll use mean squared error (MSE) for this regression task:
Where:
- is the loss (scalar value we want to minimize)
- is the network's prediction
- is the ground truth target
- The factor simplifies the derivative (the 2 from the power rule cancels it)
In Plain English: We square the gap between prediction and target. Our network predicted 0.1013 instead of 1.0, so . A loss of 0.4038 confirms the network needs significant weight updates.
Key Insight: The factor of is a mathematical convenience, not a requirement. It makes the gradient cleaner: instead of $2(\hat{y} - y_{true})$. You'll see this convention throughout deep learning literature.
The Backward Pass Computes Gradients
The backward pass is where backpropagation happens. Starting from the loss, we compute the gradient of the loss with respect to every weight by applying the chain rule backward through the graph.
Step 1: Gradient of loss with respect to the prediction.
Step 2: Gradients with respect to output weights. Since :
Step 3: Gradients flowing back to the hidden layer. This is where the chain rule shines. We need and :
Step 4: Through the sigmoid activation. The sigmoid derivative is :
Step 5: Gradients with respect to input weights.
| Weight | Gradient | Direction of Update |
|---|---|---|
| Increase (gradient is negative) | ||
| Increase | ||
| Decrease | ||
| Decrease | ||
| Increase | ||
| Increase |
Pro Tip: Notice how the gradients for the output layer () are roughly 5 to 10 times larger than those for the input layer ( through ). This isn't a coincidence. Each backward step through the sigmoid squishes the gradient by at most 0.25 (the maximum of ). Stack enough layers and you've got the vanishing gradient problem.
Click to expandChain rule decomposition showing gradient flow and multiplication through each layer of the network
Computational Graphs Make Backpropagation Systematic
A computational graph represents every operation in the forward pass as a node, with edges showing data dependencies. Backpropagation traverses this graph in reverse topological order, accumulating gradients along every path from the loss to each parameter.
Each node stores its output value (from the forward pass) and its local gradient (for the backward pass). The backward pass applies a simple rule at each node: multiply the incoming gradient by the local gradient and pass the result to all parent nodes.
When a node feeds into multiple downstream operations, gradients from those paths are summed. This is the multivariate chain rule in action.
Common Pitfall: Forgetting to sum gradients when a variable is used in multiple places is the most common bug in manual backpropagation implementations. If feeds into three output neurons, you need .
This perspective also explains why backpropagation is in the number of operations, matching the forward pass complexity. Each edge is visited exactly once during the backward pass. Numerical differentiation, by comparison, requires forward passes for parameters. For a model with billions of parameters, that difference is everything.
Vanishing and Exploding Gradients
Vanishing and exploding gradients are failure modes where gradients become either too small or too large as they propagate backward through many layers.
Why Gradients Vanish
In a deep network with sigmoid layers, the gradient at each layer gets multiplied by , where . After layers:
Where:
- is the gradient at the first layer
- is the sigmoid derivative at layer , bounded by 0.25
- is the weight at layer
- The product shrinks exponentially with depth
In Plain English: If each layer multiplies the gradient by 0.2, then after 10 layers the gradient is $0.2^{10} \approx 0.0000001$. The early layers learn at a glacial pace while the later layers have already converged. This is why deep sigmoid networks were nearly impossible to train before modern solutions emerged.
Why Gradients Explode
The opposite happens when weights are large. If at each layer, the product grows exponentially. Training becomes unstable: weights oscillate wildly, loss spikes to infinity, and NaN values appear in your tensors.
Solutions That Actually Work
| Problem | Solution | How It Helps |
|---|---|---|
| Vanishing gradients | ReLU activation | Derivative is 1 for positive inputs (no squishing) |
| Vanishing gradients | Residual connections | Gradient has a direct path that skips layers |
| Vanishing gradients | LSTM/GRU gates | Gating mechanism controls gradient flow in recurrent networks |
| Exploding gradients | Gradient clipping | Cap gradient norm to a threshold (typically 1.0) |
| Both | Proper initialization | Xavier or He initialization keeps variance stable |
| Both | Batch normalization | Normalizes activations, stabilizes gradient magnitudes |
Key Insight: The shift from sigmoid to ReLU activations was arguably the single most impactful practical advance for training deep networks. ReLU doesn't squish gradients for positive inputs, so the vanishing problem largely disappears. But ReLU introduces its own issue: dying neurons, where a neuron's output becomes permanently zero.
Click to expandGradient magnitude comparison across 10 layers showing vanishing gradients with sigmoid versus stable gradients with ReLU
Manual Backpropagation vs. Automatic Differentiation
Nobody implements backpropagation by hand for production networks. Automatic differentiation (autodiff) computes exact gradients by recording the computational graph and applying the chain rule automatically. There are two modes:
Forward mode propagates derivatives from inputs to outputs, computing the Jacobian one column at a time. Efficient when you have few inputs and many outputs.
Reverse mode is what backpropagation uses. It propagates derivatives backward, computing the Jacobian one row at a time. Since neural networks have one scalar loss and millions of parameters, reverse mode is vastly more efficient: one backward pass gives you all gradients.
| Property | Forward Mode | Reverse Mode (Backprop) |
|---|---|---|
| Direction | Input to output | Output to input |
| Cost per pass | One column of Jacobian | One row of Jacobian |
| Best when | Few inputs, many outputs | Few outputs, many inputs |
| Memory | Low | Higher (stores activations) |
| Used in deep learning | Rarely | Always |
Pro Tip: Reverse mode autodiff trades memory for speed. It must store all intermediate activations from the forward pass to use during the backward pass. This is why GPU memory is the bottleneck during training. Techniques like gradient checkpointing sacrifice compute to reclaim memory by recomputing some activations instead of storing them.
Modern Autograd: PyTorch, JAX, and torch.compile
Modern frameworks handle backpropagation through sophisticated autograd engines. Here's how the same gradient computation looks in practice.
PyTorch's Dynamic Computational Graph
PyTorch (version 2.10 as of March 2026) builds the computational graph on the fly as operations execute. This "define-by-run" approach means you can use standard Python control flow (if statements, loops) and still get correct gradients.
import torch
# Same network as our running example
x = torch.tensor([0.5, 0.8])
w_hidden = torch.tensor([[0.3, 0.5], [-0.1, 0.2]], requires_grad=True)
b_hidden = torch.tensor([0.1, -0.2], requires_grad=True)
w_out = torch.tensor([0.4, -0.3], requires_grad=True)
b_out = torch.tensor([0.05], requires_grad=True)
# Forward pass
z = x @ w_hidden + b_hidden # [0.17, 0.21]
h = torch.sigmoid(z) # [0.5424, 0.5523]
y_hat = h @ w_out + b_out # [0.1013]
# Loss
target = torch.tensor([1.0])
loss = 0.5 * (y_hat - target) ** 2 # 0.4038
# Backward pass — one call computes ALL gradients
loss.backward()
print(f"dL/dw_out: {w_out.grad}") # tensor([-0.4875, -0.4963])
print(f"dL/dw_hidden:\n{w_hidden.grad}")
# tensor([[-0.0446, 0.0334],
# [-0.0714, 0.0534]])
Those gradient values match our hand computation exactly. That's the beauty of PyTorch's autograd engine: it applies the same chain rule math, but without any manual derivatives.
JAX's Functional Approach
JAX (version 0.9.1 as of March 2026) takes a functional approach. You write a pure function and jax.grad returns a new function that computes its gradient:
import jax
import jax.numpy as jnp
def forward(params, x, target):
w_hidden, b_hidden, w_out, b_out = params
z = x @ w_hidden + b_hidden
h = jax.nn.sigmoid(z)
y_hat = h @ w_out + b_out
loss = 0.5 * (y_hat - target) ** 2
return loss.squeeze()
# jax.grad returns a FUNCTION that computes gradients
grad_fn = jax.grad(forward)
params = (
jnp.array([[0.3, 0.5], [-0.1, 0.2]]), # w_hidden
jnp.array([0.1, -0.2]), # b_hidden
jnp.array([0.4, -0.3]), # w_out
jnp.array([0.05]), # b_out
)
grads = grad_fn(params, jnp.array([0.5, 0.8]), jnp.array([1.0]))
# grads has the same structure as params — one gradient per parameter
Key Insight: JAX's grad is composable. You can take jax.grad(jax.grad(f)) to get second derivatives, or combine jax.grad with jax.vmap to compute per-example gradients in a batch, something that requires awkward workarounds in PyTorch.
torch.compile Accelerates the Backward Pass
torch.compile (available since PyTorch 2.0) uses TorchDynamo to capture and optimize the entire computational graph, including the backward pass. AOT Autograd traces the backward pass at compile time, enabling kernel fusion:
@torch.compile
def train_step(model, x, target):
y_hat = model(x)
loss = torch.nn.functional.mse_loss(y_hat, target)
loss.backward()
return loss
# Typical speedup: 1.3x-2x on GPU for medium-to-large models
In benchmarks, torch.compile reduces backward pass time by 30 to 50% on transformer architectures because gradient computations involve many small operations that benefit from kernel fusion.
Common Pitfalls and How to Avoid Them
Real-world backpropagation training runs into specific failure modes beyond vanishing gradients. Here's a quick reference:
| Pitfall | Symptom | Fix |
|---|---|---|
| Dying ReLU neurons | >10% of neurons output zero for all inputs | Use Leaky ReLU () or reduce learning rate |
| Sigmoid/tanh saturation | Near-zero gradients when $ | z |
| Softmax overflow | NaN from with large | Subtract before exponentiation; use F.cross_entropy |
| Gradient accumulation bug | Loss doesn't converge, gradients grow | Call optimizer.zero_grad() before each .backward() |
Common Pitfall: Never implement cross-entropy loss by computing softmax and then taking the log separately. Use the fused log_softmax function (or F.cross_entropy in PyTorch), which is numerically stable and more efficient.
When Backpropagation Works and When It Struggles
Backpropagation is the default training algorithm for any differentiable model, but it has clear boundaries.
Use backpropagation when:
- Your model is composed of differentiable operations
- You have a scalar loss function to minimize
- Gradient information is meaningful (smooth loss surface)
- You have enough memory to store activations for the backward pass
Backpropagation struggles when:
- The loss function is non-differentiable (use REINFORCE or straight-through estimators)
- The loss surface is extremely non-convex with many sharp local minima
- Memory is severely constrained (look into gradient checkpointing or reversible architectures)
- You need second-order information (consider L-BFGS or natural gradient methods)
For a deeper treatment of the optimizers that use backpropagation's gradients, see our guide on deep learning optimizers from SGD to AdamW.
Conclusion
Backpropagation is the chain rule applied systematically to a computational graph. That simple idea enables training networks with billions of parameters. The algorithm hasn't fundamentally changed since 1986; what's changed is our ability to execute it efficiently at massive scale.
Our running example produced the exact same gradients by hand and through PyTorch's autograd. Frameworks don't do anything magical. They automate the same math while handling numerical stability, memory management, and GPU parallelism that would be painful to manage manually.
If you're building neural networks from scratch, our guide on building a neural network from scratch in Python walks through forward pass, backpropagation, and weight updates end to end. To understand why different loss surfaces respond differently to gradient-based optimization, the bias-variance tradeoff explains the fundamental tension every model faces.
The best way to truly internalize backpropagation is to compute a few gradients by hand, verify them against autograd, and then trust the framework to handle the rest.
Interview Questions
Q: Walk me through backpropagation in your own words. What's the core idea?
Backpropagation computes the gradient of the loss with respect to every weight by applying the chain rule backward through the network's computational graph. Starting from the loss, each node multiplies the incoming gradient by its local derivative and passes the result upstream. The entire set of gradients is computed in a single backward pass with the same time complexity as the forward pass.
Q: Why does backpropagation use reverse mode autodiff instead of forward mode?
Neural networks typically have millions of parameters but only one scalar loss. Reverse mode computes one row of the Jacobian per pass, so a single backward pass gives gradients for all parameters. Forward mode computes one column per pass, meaning you'd need one pass per parameter, which is computationally infeasible for large models.
Q: What causes vanishing gradients, and how do you fix them?
Vanishing gradients occur when the product of local derivatives across many layers shrinks exponentially, often because activations like sigmoid have a maximum derivative of 0.25. The most effective fixes are using ReLU activations (derivative of 1 for positive inputs), adding residual connections (providing a gradient shortcut), and applying proper weight initialization (Xavier or He).
Q: You notice your model's loss suddenly jumps to NaN during training. What's your debugging process?
Check gradient norms before the NaN appears; if they're growing exponentially, apply gradient clipping. Then look for numerical instability: log(0), division by zero, or raw softmax without the max-subtraction trick. Check for corrupted data (missing values that become NaN in tensors), and try reducing the learning rate.
Q: Why does training use more memory than inference?
Training stores all intermediate activations from the forward pass because the backward pass needs them for gradient computation. Inference only keeps the current layer's activations. Gradient checkpointing trades compute for memory by discarding some activations and recomputing them during the backward pass.
Q: What happens if you forget optimizer.zero_grad() in PyTorch?
Gradients accumulate across backward passes because PyTorch adds new gradients to existing .grad tensors rather than replacing them. Without zeroing, each step uses the sum of current and all previous gradients, producing incorrect updates. This behavior is intentional for gradient accumulation strategies that simulate larger batch sizes.
Q: How would you verify that your custom backward pass implementation is correct?
Use numerical gradient checking: for each weight, compute with a small (around $10^{-5}) and compare it to the analytical gradient. The relative difference should be below $10^{-5}. Both PyTorch (torch.autograd.gradcheck) and JAX provide built-in utilities for this verification. Always test with double precision to avoid floating-point noise.
Q: A colleague proposes using sigmoid activations throughout a 50-layer network. What's your response?
Virtually untrainable. Each sigmoid layer multiplies the gradient by at most 0.25, so after 50 layers it's attenuated by $0.25^{50} \approx 10^{-30}$. The first layers receive essentially zero gradient signal. Use ReLU-family activations with residual connections, which is the standard pattern for very deep networks.