Skip to content

Backpropagation: The Engine of Deep Learning

DS
LDS Team
Let's Data Science
16 minAudio
Listen Along
0:00/ 0:00
AI voice

Every time you train a neural network, backpropagation is doing the heavy lifting. It's the algorithm that answers the most important question in optimization: how should each weight change to reduce the loss? Without it, deep learning simply wouldn't exist. Since Rumelhart, Hinton, and Williams published their landmark 1986 paper in Nature, backpropagation has remained the backbone of neural network training, powering everything from image classifiers to the large language models behind modern AI assistants.

We'll work through backpropagation from the ground up using one consistent example: a tiny 2-layer network that predicts a single output from two inputs. You'll see the forward pass, loss computation, and backward pass with actual numbers you can verify on paper, then connect that manual math to the automatic differentiation engines that handle the work in practice.

The Chain Rule Powers Everything

Backpropagation is just the chain rule from calculus applied systematically to a computational graph. If a neural network is a sequence of composed functions, the chain rule tells you how to decompose the derivative of the entire composition into a product of local derivatives at each step.

Consider three functions composed together: y=f(g(h(x)))y = f(g(h(x))). The chain rule says:

dydx=dfdgdgdhdhdx\frac{dy}{dx} = \frac{df}{dg} \cdot \frac{dg}{dh} \cdot \frac{dh}{dx}

Where:

  • dfdg\frac{df}{dg} is the derivative of the outermost function with respect to its input
  • dgdh\frac{dg}{dh} is the derivative of the middle function with respect to its input
  • dhdx\frac{dh}{dx} is the derivative of the innermost function with respect to the original input xx

In Plain English: Each layer in a neural network is one of these composed functions. Backpropagation multiplies the local gradients at each layer together, starting from the loss and working backward to the inputs. It's like asking "if I wiggle this weight by a tiny amount, how much does the final loss change?" and getting the answer by multiplying a chain of small effects.

This decomposition is what makes backpropagation efficient. Instead of computing each weight's derivative independently (requiring a separate forward pass each time), you compute all of them in a single backward pass by reusing intermediate results.

Computational graph showing forward pass values and backward pass gradients through a two-layer networkClick to expandComputational graph showing forward pass values and backward pass gradients through a two-layer network

A Concrete Network with Real Numbers

The best way to understand backpropagation is to do it by hand. Here's our running example: a 2-layer network with two inputs, one hidden layer (two neurons), and one output neuron. No activation function on the output, and sigmoid activation on the hidden layer.

Network setup:

  • Inputs: x1=0.5x_1 = 0.5, x2=0.8x_2 = 0.8
  • Hidden layer weights: w1=0.3w_1 = 0.3, w2=0.1w_2 = -0.1, w3=0.5w_3 = 0.5, w4=0.2w_4 = 0.2
  • Hidden biases: b1=0.1b_1 = 0.1, b2=0.2b_2 = -0.2
  • Output weights: w5=0.4w_5 = 0.4, w6=0.3w_6 = -0.3
  • Output bias: b3=0.05b_3 = 0.05
  • Target: ytrue=1.0y_{true} = 1.0
SymbolValueDescription
x1,x2x_1, x_20.5, 0.8Input features
w1,w2w_1, w_20.3, -0.1Weights to hidden neuron 1
w3,w4w_3, w_40.5, 0.2Weights to hidden neuron 2
w5,w6w_5, w_60.4, -0.3Weights from hidden to output
b1,b2,b3b_1, b_2, b_30.1, -0.2, 0.05Biases

The Forward Pass Computes Predictions

The forward pass pushes data through the network layer by layer, computing each neuron's output. For our network, the hidden layer computes two values:

z1=w1x1+w2x2+b1z_1 = w_1 \cdot x_1 + w_2 \cdot x_2 + b_1

Where:

  • z1z_1 is the pre-activation value for hidden neuron 1
  • w1,w2w_1, w_2 are the weights connecting inputs to this neuron
  • x1,x2x_1, x_2 are the input values
  • b1b_1 is the bias term

In Plain English: We're computing a weighted sum of the inputs plus a bias. For our numbers: z1=0.3×0.5+(0.1)×0.8+0.1=0.17z_1 = 0.3 \times 0.5 + (-0.1) \times 0.8 + 0.1 = 0.17.

Hidden neuron 1:

  • z1=0.3(0.5)+(0.1)(0.8)+0.1=0.150.08+0.1=0.17z_1 = 0.3(0.5) + (-0.1)(0.8) + 0.1 = 0.15 - 0.08 + 0.1 = 0.17
  • h1=σ(z1)=σ(0.17)=11+e0.17=0.5424h_1 = \sigma(z_1) = \sigma(0.17) = \frac{1}{1 + e^{-0.17}} = 0.5424

Hidden neuron 2:

  • z2=0.5(0.5)+0.2(0.8)+(0.2)=0.25+0.160.2=0.21z_2 = 0.5(0.5) + 0.2(0.8) + (-0.2) = 0.25 + 0.16 - 0.2 = 0.21
  • h2=σ(z2)=σ(0.21)=11+e0.21=0.5523h_2 = \sigma(z_2) = \sigma(0.21) = \frac{1}{1 + e^{-0.21}} = 0.5523

Output neuron:

  • y^=w5h1+w6h2+b3=0.4(0.5424)+(0.3)(0.5523)+0.05\hat{y} = w_5 \cdot h_1 + w_6 \cdot h_2 + b_3 = 0.4(0.5424) + (-0.3)(0.5523) + 0.05
  • y^=0.21700.1657+0.05=0.1013\hat{y} = 0.2170 - 0.1657 + 0.05 = 0.1013

The network predicted 0.1013 when the target is 1.0. That's way off, which means the loss will be large and the gradients will push the weights hard.

Loss Computation Measures the Error

Loss functions quantify how wrong the prediction is. We'll use mean squared error (MSE) for this regression task:

L=12(y^ytrue)2L = \frac{1}{2}(\hat{y} - y_{true})^2

Where:

  • LL is the loss (scalar value we want to minimize)
  • y^\hat{y} is the network's prediction
  • ytruey_{true} is the ground truth target
  • The 12\frac{1}{2} factor simplifies the derivative (the 2 from the power rule cancels it)

In Plain English: We square the gap between prediction and target. Our network predicted 0.1013 instead of 1.0, so L=12(0.10131.0)2=12(0.8987)2=0.4038L = \frac{1}{2}(0.1013 - 1.0)^2 = \frac{1}{2}(0.8987)^2 = 0.4038. A loss of 0.4038 confirms the network needs significant weight updates.

Key Insight: The factor of 12\frac{1}{2} is a mathematical convenience, not a requirement. It makes the gradient cleaner: Ly^=y^ytrue\frac{\partial L}{\partial \hat{y}} = \hat{y} - y_{true} instead of $2(\hat{y} - y_{true})$. You'll see this convention throughout deep learning literature.

The Backward Pass Computes Gradients

The backward pass is where backpropagation happens. Starting from the loss, we compute the gradient of the loss with respect to every weight by applying the chain rule backward through the graph.

Step 1: Gradient of loss with respect to the prediction.

Ly^=y^ytrue=0.10131.0=0.8987\frac{\partial L}{\partial \hat{y}} = \hat{y} - y_{true} = 0.1013 - 1.0 = -0.8987

Step 2: Gradients with respect to output weights. Since y^=w5h1+w6h2+b3\hat{y} = w_5 h_1 + w_6 h_2 + b_3:

Lw5=Ly^h1=(0.8987)(0.5424)=0.4875\frac{\partial L}{\partial w_5} = \frac{\partial L}{\partial \hat{y}} \cdot h_1 = (-0.8987)(0.5424) = -0.4875

Lw6=Ly^h2=(0.8987)(0.5523)=0.4963\frac{\partial L}{\partial w_6} = \frac{\partial L}{\partial \hat{y}} \cdot h_2 = (-0.8987)(0.5523) = -0.4963

Lb3=Ly^1=0.8987\frac{\partial L}{\partial b_3} = \frac{\partial L}{\partial \hat{y}} \cdot 1 = -0.8987

Step 3: Gradients flowing back to the hidden layer. This is where the chain rule shines. We need Lh1\frac{\partial L}{\partial h_1} and Lh2\frac{\partial L}{\partial h_2}:

Lh1=Ly^w5=(0.8987)(0.4)=0.3595\frac{\partial L}{\partial h_1} = \frac{\partial L}{\partial \hat{y}} \cdot w_5 = (-0.8987)(0.4) = -0.3595

Lh2=Ly^w6=(0.8987)(0.3)=0.2696\frac{\partial L}{\partial h_2} = \frac{\partial L}{\partial \hat{y}} \cdot w_6 = (-0.8987)(-0.3) = 0.2696

Step 4: Through the sigmoid activation. The sigmoid derivative is σ(z)=σ(z)(1σ(z))\sigma'(z) = \sigma(z)(1 - \sigma(z)):

Lz1=Lh1σ(z1)=(0.3595)(0.5424)(10.5424)=(0.3595)(0.2482)=0.0893\frac{\partial L}{\partial z_1} = \frac{\partial L}{\partial h_1} \cdot \sigma'(z_1) = (-0.3595)(0.5424)(1 - 0.5424) = (-0.3595)(0.2482) = -0.0893

Lz2=Lh2σ(z2)=(0.2696)(0.5523)(10.5523)=(0.2696)(0.2474)=0.0667\frac{\partial L}{\partial z_2} = \frac{\partial L}{\partial h_2} \cdot \sigma'(z_2) = (0.2696)(0.5523)(1 - 0.5523) = (0.2696)(0.2474) = 0.0667

Step 5: Gradients with respect to input weights.

Lw1=Lz1x1=(0.0893)(0.5)=0.0446\frac{\partial L}{\partial w_1} = \frac{\partial L}{\partial z_1} \cdot x_1 = (-0.0893)(0.5) = -0.0446

Lw2=Lz1x2=(0.0893)(0.8)=0.0714\frac{\partial L}{\partial w_2} = \frac{\partial L}{\partial z_1} \cdot x_2 = (-0.0893)(0.8) = -0.0714

Lw3=Lz2x1=(0.0667)(0.5)=0.0334\frac{\partial L}{\partial w_3} = \frac{\partial L}{\partial z_2} \cdot x_1 = (0.0667)(0.5) = 0.0334

Lw4=Lz2x2=(0.0667)(0.8)=0.0534\frac{\partial L}{\partial w_4} = \frac{\partial L}{\partial z_2} \cdot x_2 = (0.0667)(0.8) = 0.0534

WeightGradientDirection of Update
w1w_10.0446-0.0446Increase (gradient is negative)
w2w_20.0714-0.0714Increase
w3w_3+0.0334+0.0334Decrease
w4w_4+0.0534+0.0534Decrease
w5w_50.4875-0.4875Increase
w6w_60.4963-0.4963Increase

Pro Tip: Notice how the gradients for the output layer (w5,w6w_5, w_6) are roughly 5 to 10 times larger than those for the input layer (w1w_1 through w4w_4). This isn't a coincidence. Each backward step through the sigmoid squishes the gradient by at most 0.25 (the maximum of σ(z)\sigma'(z)). Stack enough layers and you've got the vanishing gradient problem.

Chain rule decomposition showing gradient flow and multiplication through each layer of the networkClick to expandChain rule decomposition showing gradient flow and multiplication through each layer of the network

Computational Graphs Make Backpropagation Systematic

A computational graph represents every operation in the forward pass as a node, with edges showing data dependencies. Backpropagation traverses this graph in reverse topological order, accumulating gradients along every path from the loss to each parameter.

Each node stores its output value (from the forward pass) and its local gradient (for the backward pass). The backward pass applies a simple rule at each node: multiply the incoming gradient by the local gradient and pass the result to all parent nodes.

When a node feeds into multiple downstream operations, gradients from those paths are summed. This is the multivariate chain rule in action.

Common Pitfall: Forgetting to sum gradients when a variable is used in multiple places is the most common bug in manual backpropagation implementations. If h1h_1 feeds into three output neurons, you need Lh1=kLokokh1\frac{\partial L}{\partial h_1} = \sum_{k} \frac{\partial L}{\partial o_k} \cdot \frac{\partial o_k}{\partial h_1}.

This perspective also explains why backpropagation is O(n)O(n) in the number of operations, matching the forward pass complexity. Each edge is visited exactly once during the backward pass. Numerical differentiation, by comparison, requires O(p)O(p) forward passes for pp parameters. For a model with billions of parameters, that difference is everything.

Vanishing and Exploding Gradients

Vanishing and exploding gradients are failure modes where gradients become either too small or too large as they propagate backward through many layers.

Why Gradients Vanish

In a deep network with LL sigmoid layers, the gradient at each layer gets multiplied by σ(z)w\sigma'(z) \cdot w, where σ(z)0.25\sigma'(z) \leq 0.25. After LL layers:

Lw(1)l=1Lσ(z(l))w(l)\frac{\partial L}{\partial w^{(1)}} \propto \prod_{l=1}^{L} \sigma'(z^{(l)}) \cdot w^{(l)}

Where:

  • Lw(1)\frac{\partial L}{\partial w^{(1)}} is the gradient at the first layer
  • σ(z(l))\sigma'(z^{(l)}) is the sigmoid derivative at layer ll, bounded by 0.25
  • w(l)w^{(l)} is the weight at layer ll
  • The product shrinks exponentially with depth

In Plain English: If each layer multiplies the gradient by 0.2, then after 10 layers the gradient is $0.2^{10} \approx 0.0000001$. The early layers learn at a glacial pace while the later layers have already converged. This is why deep sigmoid networks were nearly impossible to train before modern solutions emerged.

Why Gradients Explode

The opposite happens when weights are large. If σ(z)w>1|\sigma'(z) \cdot w| > 1 at each layer, the product grows exponentially. Training becomes unstable: weights oscillate wildly, loss spikes to infinity, and NaN values appear in your tensors.

Solutions That Actually Work

ProblemSolutionHow It Helps
Vanishing gradientsReLU activationDerivative is 1 for positive inputs (no squishing)
Vanishing gradientsResidual connectionsGradient has a direct path that skips layers
Vanishing gradientsLSTM/GRU gatesGating mechanism controls gradient flow in recurrent networks
Exploding gradientsGradient clippingCap gradient norm to a threshold (typically 1.0)
BothProper initializationXavier or He initialization keeps variance stable
BothBatch normalizationNormalizes activations, stabilizes gradient magnitudes

Key Insight: The shift from sigmoid to ReLU activations was arguably the single most impactful practical advance for training deep networks. ReLU doesn't squish gradients for positive inputs, so the vanishing problem largely disappears. But ReLU introduces its own issue: dying neurons, where a neuron's output becomes permanently zero.

Gradient magnitude comparison across 10 layers showing vanishing gradients with sigmoid versus stable gradients with ReLUClick to expandGradient magnitude comparison across 10 layers showing vanishing gradients with sigmoid versus stable gradients with ReLU

Manual Backpropagation vs. Automatic Differentiation

Nobody implements backpropagation by hand for production networks. Automatic differentiation (autodiff) computes exact gradients by recording the computational graph and applying the chain rule automatically. There are two modes:

Forward mode propagates derivatives from inputs to outputs, computing the Jacobian one column at a time. Efficient when you have few inputs and many outputs.

Reverse mode is what backpropagation uses. It propagates derivatives backward, computing the Jacobian one row at a time. Since neural networks have one scalar loss and millions of parameters, reverse mode is vastly more efficient: one backward pass gives you all gradients.

PropertyForward ModeReverse Mode (Backprop)
DirectionInput to outputOutput to input
Cost per passOne column of JacobianOne row of Jacobian
Best whenFew inputs, many outputsFew outputs, many inputs
MemoryLowHigher (stores activations)
Used in deep learningRarelyAlways

Pro Tip: Reverse mode autodiff trades memory for speed. It must store all intermediate activations from the forward pass to use during the backward pass. This is why GPU memory is the bottleneck during training. Techniques like gradient checkpointing sacrifice compute to reclaim memory by recomputing some activations instead of storing them.

Modern Autograd: PyTorch, JAX, and torch.compile

Modern frameworks handle backpropagation through sophisticated autograd engines. Here's how the same gradient computation looks in practice.

PyTorch's Dynamic Computational Graph

PyTorch (version 2.10 as of March 2026) builds the computational graph on the fly as operations execute. This "define-by-run" approach means you can use standard Python control flow (if statements, loops) and still get correct gradients.

python
import torch

# Same network as our running example
x = torch.tensor([0.5, 0.8])
w_hidden = torch.tensor([[0.3, 0.5], [-0.1, 0.2]], requires_grad=True)
b_hidden = torch.tensor([0.1, -0.2], requires_grad=True)
w_out = torch.tensor([0.4, -0.3], requires_grad=True)
b_out = torch.tensor([0.05], requires_grad=True)

# Forward pass
z = x @ w_hidden + b_hidden       # [0.17, 0.21]
h = torch.sigmoid(z)               # [0.5424, 0.5523]
y_hat = h @ w_out + b_out          # [0.1013]

# Loss
target = torch.tensor([1.0])
loss = 0.5 * (y_hat - target) ** 2  # 0.4038

# Backward pass — one call computes ALL gradients
loss.backward()

print(f"dL/dw_out: {w_out.grad}")    # tensor([-0.4875, -0.4963])
print(f"dL/dw_hidden:\n{w_hidden.grad}")
# tensor([[-0.0446,  0.0334],
#         [-0.0714,  0.0534]])

Those gradient values match our hand computation exactly. That's the beauty of PyTorch's autograd engine: it applies the same chain rule math, but without any manual derivatives.

JAX's Functional Approach

JAX (version 0.9.1 as of March 2026) takes a functional approach. You write a pure function and jax.grad returns a new function that computes its gradient:

python
import jax
import jax.numpy as jnp

def forward(params, x, target):
    w_hidden, b_hidden, w_out, b_out = params
    z = x @ w_hidden + b_hidden
    h = jax.nn.sigmoid(z)
    y_hat = h @ w_out + b_out
    loss = 0.5 * (y_hat - target) ** 2
    return loss.squeeze()

# jax.grad returns a FUNCTION that computes gradients
grad_fn = jax.grad(forward)

params = (
    jnp.array([[0.3, 0.5], [-0.1, 0.2]]),  # w_hidden
    jnp.array([0.1, -0.2]),                  # b_hidden
    jnp.array([0.4, -0.3]),                  # w_out
    jnp.array([0.05]),                        # b_out
)

grads = grad_fn(params, jnp.array([0.5, 0.8]), jnp.array([1.0]))
# grads has the same structure as params — one gradient per parameter

Key Insight: JAX's grad is composable. You can take jax.grad(jax.grad(f)) to get second derivatives, or combine jax.grad with jax.vmap to compute per-example gradients in a batch, something that requires awkward workarounds in PyTorch.

torch.compile Accelerates the Backward Pass

torch.compile (available since PyTorch 2.0) uses TorchDynamo to capture and optimize the entire computational graph, including the backward pass. AOT Autograd traces the backward pass at compile time, enabling kernel fusion:

python
@torch.compile
def train_step(model, x, target):
    y_hat = model(x)
    loss = torch.nn.functional.mse_loss(y_hat, target)
    loss.backward()
    return loss
# Typical speedup: 1.3x-2x on GPU for medium-to-large models

In benchmarks, torch.compile reduces backward pass time by 30 to 50% on transformer architectures because gradient computations involve many small operations that benefit from kernel fusion.

Common Pitfalls and How to Avoid Them

Real-world backpropagation training runs into specific failure modes beyond vanishing gradients. Here's a quick reference:

PitfallSymptomFix
Dying ReLU neurons>10% of neurons output zero for all inputsUse Leaky ReLU (α=0.01\alpha = 0.01) or reduce learning rate
Sigmoid/tanh saturationNear-zero gradients when $z
Softmax overflowNaN from ezie^{z_i} with large ziz_iSubtract max(z)\max(z) before exponentiation; use F.cross_entropy
Gradient accumulation bugLoss doesn't converge, gradients growCall optimizer.zero_grad() before each .backward()

Common Pitfall: Never implement cross-entropy loss by computing softmax and then taking the log separately. Use the fused log_softmax function (or F.cross_entropy in PyTorch), which is numerically stable and more efficient.

When Backpropagation Works and When It Struggles

Backpropagation is the default training algorithm for any differentiable model, but it has clear boundaries.

Use backpropagation when:

  • Your model is composed of differentiable operations
  • You have a scalar loss function to minimize
  • Gradient information is meaningful (smooth loss surface)
  • You have enough memory to store activations for the backward pass

Backpropagation struggles when:

  • The loss function is non-differentiable (use REINFORCE or straight-through estimators)
  • The loss surface is extremely non-convex with many sharp local minima
  • Memory is severely constrained (look into gradient checkpointing or reversible architectures)
  • You need second-order information (consider L-BFGS or natural gradient methods)

For a deeper treatment of the optimizers that use backpropagation's gradients, see our guide on deep learning optimizers from SGD to AdamW.

Conclusion

Backpropagation is the chain rule applied systematically to a computational graph. That simple idea enables training networks with billions of parameters. The algorithm hasn't fundamentally changed since 1986; what's changed is our ability to execute it efficiently at massive scale.

Our running example produced the exact same gradients by hand and through PyTorch's autograd. Frameworks don't do anything magical. They automate the same math while handling numerical stability, memory management, and GPU parallelism that would be painful to manage manually.

If you're building neural networks from scratch, our guide on building a neural network from scratch in Python walks through forward pass, backpropagation, and weight updates end to end. To understand why different loss surfaces respond differently to gradient-based optimization, the bias-variance tradeoff explains the fundamental tension every model faces.

The best way to truly internalize backpropagation is to compute a few gradients by hand, verify them against autograd, and then trust the framework to handle the rest.

Interview Questions

Q: Walk me through backpropagation in your own words. What's the core idea?

Backpropagation computes the gradient of the loss with respect to every weight by applying the chain rule backward through the network's computational graph. Starting from the loss, each node multiplies the incoming gradient by its local derivative and passes the result upstream. The entire set of gradients is computed in a single backward pass with the same time complexity as the forward pass.

Q: Why does backpropagation use reverse mode autodiff instead of forward mode?

Neural networks typically have millions of parameters but only one scalar loss. Reverse mode computes one row of the Jacobian per pass, so a single backward pass gives gradients for all parameters. Forward mode computes one column per pass, meaning you'd need one pass per parameter, which is computationally infeasible for large models.

Q: What causes vanishing gradients, and how do you fix them?

Vanishing gradients occur when the product of local derivatives across many layers shrinks exponentially, often because activations like sigmoid have a maximum derivative of 0.25. The most effective fixes are using ReLU activations (derivative of 1 for positive inputs), adding residual connections (providing a gradient shortcut), and applying proper weight initialization (Xavier or He).

Q: You notice your model's loss suddenly jumps to NaN during training. What's your debugging process?

Check gradient norms before the NaN appears; if they're growing exponentially, apply gradient clipping. Then look for numerical instability: log(0), division by zero, or raw softmax without the max-subtraction trick. Check for corrupted data (missing values that become NaN in tensors), and try reducing the learning rate.

Q: Why does training use more memory than inference?

Training stores all intermediate activations from the forward pass because the backward pass needs them for gradient computation. Inference only keeps the current layer's activations. Gradient checkpointing trades compute for memory by discarding some activations and recomputing them during the backward pass.

Q: What happens if you forget optimizer.zero_grad() in PyTorch?

Gradients accumulate across backward passes because PyTorch adds new gradients to existing .grad tensors rather than replacing them. Without zeroing, each step uses the sum of current and all previous gradients, producing incorrect updates. This behavior is intentional for gradient accumulation strategies that simulate larger batch sizes.

Q: How would you verify that your custom backward pass implementation is correct?

Use numerical gradient checking: for each weight, compute L(w+ϵ)L(wϵ)2ϵ\frac{L(w + \epsilon) - L(w - \epsilon)}{2\epsilon} with a small ϵ\epsilon (around $10^{-5}) and compare it to the analytical gradient. The relative difference should be below $10^{-5}. Both PyTorch (torch.autograd.gradcheck) and JAX provide built-in utilities for this verification. Always test with double precision to avoid floating-point noise.

Q: A colleague proposes using sigmoid activations throughout a 50-layer network. What's your response?

Virtually untrainable. Each sigmoid layer multiplies the gradient by at most 0.25, so after 50 layers it's attenuated by $0.25^{50} \approx 10^{-30}$. The first layers receive essentially zero gradient signal. Use ReLU-family activations with residual connections, which is the standard pattern for very deep networks.

Practice interview problems based on real data

1,500+ SQL & Python problems across 15 industry datasets — the exact type of data you work with.

Try 250 free problems
Free Career Roadmaps8 PATHS

Step-by-step roadmaps from zero to job-ready — curated courses, salary data, and the exact learning order that gets you hired.

Explore all career paths