Every time a neural network gets better at something — recognizing faces, predicting the next word, classifying a tumor — it is because of one algorithm running quietly in the background: backpropagation.

It is, without exaggeration, the algorithm that made the current era of deep learning possible. And yet, at its core, it is just calculus applied with care. Understanding it properly changes how you think about neural networks — not as black boxes, but as living systems of cause and effect.

The Setup

A neural network is a composition of functions. You feed in an input, it passes through a series of layers, and you get an output. Each layer applies a linear transformation followed by a nonlinearity:

h=σ(Wx+b)h = \sigma(Wx + b)

At the end, you compute a loss L\mathcal{L} — a scalar that measures how wrong the output is. The goal of training is to adjust the weights WW and biases bb across all layers to minimize L\mathcal{L}.

The question is: how does each weight in the network contribute to the final loss?

That’s the question backpropagation answers.

Gradients Are What We’re After

To minimize L\mathcal{L}, we use gradient descent: nudge each weight in the direction that reduces the loss. The amount to nudge is given by the partial derivative of L\mathcal{L} with respect to that weight — Lw\frac{\partial \mathcal{L}}{\partial w}.

For a single weight in the final layer, this is straightforward. But for a weight buried in layer two of a ten-layer network? It affects the loss only indirectly, through every subsequent computation. Computing that derivative requires the chain rule.

The Chain Rule, Revisited

You probably saw the chain rule in calculus. If z=f(y)z = f(y) and y=g(x)y = g(x), then:

dzdx=dzdydydx\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}

The derivative “chains” through intermediate variables. In a neural network, the loss passes through many intermediate computations — each layer is one link in the chain.

For a network with layers computing a1a2anLa_1 \to a_2 \to \cdots \to a_n \to \mathcal{L}, the gradient with respect to an early layer’s activation is:

La1=Lananan1a2a1\frac{\partial \mathcal{L}}{\partial a_1} = \frac{\partial \mathcal{L}}{\partial a_n} \cdot \frac{\partial a_n}{\partial a_{n-1}} \cdots \frac{\partial a_2}{\partial a_1}

Each term is a local gradient — how much does that layer’s output change with respect to its input? Backpropagation is the algorithm that computes these local gradients efficiently, layer by layer, starting from the output and working backward.

The Forward Pass and the Backward Pass

Training a neural network has two phases:

Forward pass. Feed the input through the network, compute the output, compute the loss. Critically, save the intermediate activations at every layer. You’ll need them.

Backward pass. Starting from the loss, propagate gradients backward through the network. At each layer, use the saved activations and the local gradient of that layer’s operation to compute how the loss changes with respect to that layer’s weights and inputs.

The gradient arriving at a layer is called the upstream gradient — it comes from everything ahead of it in the network. The layer multiplies it by the local gradient (its own derivative) to produce the downstream gradient — which it passes further back:

δdownstream=δupstream×fx\delta_{\text{downstream}} = \delta_{\text{upstream}} \times \frac{\partial f}{\partial x}

This is the core of backpropagation. Nothing more.

A Concrete Example

Consider a single neuron with input xx, weight ww, and bias bb:

z=wx+bz = wx + b

a=σ(z)(sigmoid activation)a = \sigma(z) \quad \text{(sigmoid activation)}

L=(ay)2(squared error loss)\mathcal{L} = (a - y)^2 \quad \text{(squared error loss)}

Backward pass, step by step:

La=2(ay)\frac{\partial \mathcal{L}}{\partial a} = 2(a - y)

Lz=Laσ(z)=2(ay)σ(z)(1σ(z))\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial a} \cdot \sigma'(z) = 2(a - y) \cdot \sigma(z)(1 - \sigma(z))

Lw=Lzx\frac{\partial \mathcal{L}}{\partial w} = \frac{\partial \mathcal{L}}{\partial z} \cdot x

Lb=Lz1\frac{\partial \mathcal{L}}{\partial b} = \frac{\partial \mathcal{L}}{\partial z} \cdot 1

Each step applies the chain rule. The gradient Lw\frac{\partial \mathcal{L}}{\partial w} tells us: if we increase ww slightly, by how much does the loss increase? We then subtract a small multiple of this gradient from ww — that’s a gradient descent step.

In a deep network, this process repeats at every layer, the gradients flowing backward like water finding downhill.

Why It’s Efficient

A naive approach to computing gradients would be to perturb each weight slightly and measure how much the loss changes — a numerical derivative. With millions of parameters, this would be catastrophically slow.

Backpropagation is efficient because it reuses computation. The upstream gradient δ\delta arriving at a layer is shared across all of that layer’s weights. By computing it once and multiplying, we avoid redundant work. The total cost of a backward pass is only a small constant multiple of the forward pass.

This efficiency — made possible by the structure of the chain rule — is what makes training large neural networks practical at all.

What Can Go Wrong

Backpropagation is not without its failure modes.

Vanishing gradients. In deep networks, gradients can shrink exponentially as they travel backward. By the time they reach early layers, they’re too small to cause meaningful updates. The sigmoid and tanh activations are especially prone to this — their derivatives saturate near zero at extreme inputs. ReLU largely solved this for modern architectures.

Exploding gradients. The opposite problem: gradients grow exponentially and destabilize training. Gradient clipping — capping the gradient norm at a threshold τ\tau — is the standard fix:

ggmin(1,τg)g \leftarrow g \cdot \min\left(1, \frac{\tau}{\|g\|}\right)

Dead neurons. With ReLU, neurons that always receive negative input always output zero. Their gradient is also zero, so they never update. They’re dead. Careful initialization and leaky ReLU variants help here.

The Bigger Picture

There’s something philosophically interesting about backpropagation. The network doesn’t know, in any meaningful sense, why it’s wrong. It only knows the gradient — the direction to move in weight space to be slightly less wrong. And yet, repeated over millions of examples, this local signal produces global understanding.

The loss function defines what “wrong” means. Backpropagation computes how much each weight contributed to that wrongness. Gradient descent uses that information to make small corrections. Together, they are the feedback loop that turns a randomly initialized tangle of weights into something that can read, see, and reason.

It is, in the end, a very precise way of learning from mistakes — one derivative at a time.