A neural network needs a way to measure how wrong its predictions are. This measurement is called the loss (or cost). The training process then adjusts the network's weights to reduce this loss, step by step.
What is loss?
Loss quantifies the gap between what the network predicts and the true answer. A perfect prediction has zero loss. The larger the gap, the larger the loss value.
Concrete example: Imagine a network trained to predict a patient's blood sugar level (on a scale of 0 to 1, where 0 is very low and 1 is very high). The actual reading is 0.85. The network predicts 0.62. That gap of 0.23 is the loss — the signal the network uses to improve.
Interactive: drag the prediction and see the loss change
Actual value0.85
Predicted value0.62
Actual
0.85
Predicted
0.62
Gap = Loss
0.23
The loss landscape
Every neural network has thousands — sometimes billions — of knobs called weights. Changing these weights changes the predictions, which changes the loss. If we could plot the loss against every possible combination of weights, we would get a surface called the loss landscape. Training is the process of finding the lowest point on this surface.
Below is a 3D view of what such a surface looks like. Drag to rotate, scroll to zoom. Notice how the global minimum, local minimum, and saddle point look completely different from every angle.
Loading 3D surface…
Key features
Global minimum ↗
The lowest loss point — ideal destination for training
Local minimum ↗
A valley that isn't the lowest — gradient descent can get stuck here
Saddle point ↗
Click to highlight — flat in one direction, curving in another
Real loss landscapes are not 2D or even 3D — they exist in a space with as many dimensions as there are weights in the network, which can be billions. In very high dimensions, local minima are rare — most critical points are saddle points, which gradient descent can usually escape.
Gradient descent
Gradient descent is the algorithm that navigates the loss landscape to find the minimum. Think of it like a ball rolling downhill — it always moves in the direction that reduces the loss the fastest.
How one gradient descent step works
1
Compute the loss
Run the forward pass. Compare predictions to targets. Calculate L — a single number measuring how wrong the network is right now.
2
Find the slope
For each weight, ask: "if I nudge this weight slightly, does the loss go up or down — and by how much?" This slope is called the gradient. It points in the direction that increases the loss most steeply.
3
Step downhill
Move each weight a small step in the opposite direction of the slope — downhill, not uphill. This always reduces the loss a little. The size of each step is controlled by the learning rate η.
4
Repeat until the bottom
Repeat steps 1–3 for many iterations. Each step nudges the weights slightly closer to a minimum. The steps get smaller as the slope flattens near the bottom — the network converges.
Mathematicians write those four steps in one line:
w ← w − η · (∂L / ∂w)
w is the weight, η (eta) is the step size (learning rate), and ∂L/∂w is the slope (gradient) of the loss at the current weight value.
Interactive: watch gradient descent roll downhill
Learning rate η0.18
Step
0
Weight w
—
Loss L(w)
—
The learning rate
Choosing the right learning rate is critical. Too small and training takes forever; too large and it overshoots the minimum; just right and it converges smoothly.