Loss & Gradient Descent Explainer

by Ancil Cleetus

A neural network needs a way to measure how wrong its predictions are. This measurement is called the loss (or cost). The training process then adjusts the network's weights to reduce this loss, step by step.

What is loss?

Loss quantifies the gap between what the network predicts and the true answer. A perfect prediction has zero loss. The larger the gap, the larger the loss value.

Concrete example: Imagine a network trained to predict a patient's blood sugar level (on a scale of 0 to 1, where 0 is very low and 1 is very high). The actual reading is 0.85. The network predicts 0.62. That gap of 0.23 is the loss — the signal the network uses to improve.

Interactive: drag the prediction and see the loss change

Actual value 0.85

Predicted value 0.62

Actual

0.85

Predicted

0.62

Gap = Loss

0.23

The loss landscape

Every neural network has thousands — sometimes billions — of knobs called weights. Changing these weights changes the predictions, which changes the loss. If we could plot the loss against every possible combination of weights, we would get a surface called the loss landscape. Training is the process of finding the lowest point on this surface.

Below is a 3D view of what such a surface looks like. Drag to rotate, scroll to zoom. Notice how the global minimum, local minimum, and saddle point look completely different from every angle.

Loading 3D surface…

Key features

Global minimum ↗

The lowest loss point — ideal destination for training

Local minimum ↗

A valley that isn't the lowest — gradient descent can get stuck here

Saddle point ↗

Click to highlight — flat in one direction, curving in another

Real loss landscapes are not 2D or even 3D — they exist in a space with as many dimensions as there are weights in the network, which can be billions. In very high dimensions, local minima are rare — most critical points are saddle points, which gradient descent can usually escape.

Gradient descent

Gradient descent is the algorithm that navigates the loss landscape to find the minimum. Think of it like a ball rolling downhill — it always moves in the direction that reduces the loss the fastest.

How one gradient descent step works

Compute the loss

Run the forward pass. Compare predictions to targets. Calculate L — a single number measuring how wrong the network is right now.

Find the slope

For each weight, ask: "if I nudge this weight slightly, does the loss go up or down — and by how much?" This slope is called the gradient. It points in the direction that increases the loss most steeply.

Step downhill

Move each weight a small step in the opposite direction of the slope — downhill, not uphill. This always reduces the loss a little. The size of each step is controlled by the learning rate η.

Repeat until the bottom

Repeat steps 1–3 for many iterations. Each step nudges the weights slightly closer to a minimum. The steps get smaller as the slope flattens near the bottom — the network converges.

Mathematicians write those four steps in one line:

w ← w − η · (∂L / ∂w)

w is the weight, η (eta) is the step size (learning rate), and ∂L/∂w is the slope (gradient) of the loss at the current weight value.

Interactive: watch gradient descent roll downhill

Learning rate η 0.18

Step

Weight w

—

Loss L(w)

—

The learning rate

Choosing the right learning rate is critical. Too small and training takes forever; too large and it overshoots the minimum; just right and it converges smoothly.

Select a scenario

Too small

Slow convergence

Too large

Diverges!

Just right

Smooth convergence

Loss over steps