Imagine teaching a child by never telling them when they are wrong. They would never improve. A neural network is the same — it needs a way to measure how wrong its predictions are. That measurement is called the loss. This explainer builds that intuition from scratch, then shows you which loss function to use for different tasks.
What is loss?
Phase A — The intuition
You are training a neural network to predict house prices. It looks at a house and outputs a prediction. Reality then tells it the actual sale price. The gap between the prediction and reality is the loss.
Interactive: drag the prediction to see the gap
Network predicts£300k
When the prediction exactly matches reality, the loss is zero — perfect. The further off the prediction, the larger the loss. The network uses this loss signal to adjust itself and improve. No loss signal = no learning.
Phase B — Introducing the notation
In machine learning we write the true answer as y and the model's prediction as ŷ (said "y-hat"). The loss L measures the gap between them. A loss function is just a formula that converts that gap into a single number.
y
True answer (reality)
ŷ
Prediction (model output)
L
Loss (gap formula)
True value y0.70
Prediction ŷ0.30
Gap (y − ŷ)
0.40
|Gap| = raw error
0.40
Three roles a loss function plays
1
Measuring instrument
Quantifies how wrong the current predictions are. Without a loss, the network has no signal to learn from.
2
Training objective
The network's entire job during training is to reduce this number. Gradient descent moves the weights in whatever direction makes the loss smaller.
3
Your design choice
You choose which formula to use. Different tasks need different formulas. Choosing the wrong one produces a network that optimises the wrong thing — even if it appears to train correctly.
Why different formulas? Why can't we just use the gap?
If loss just measures the gap between y and ŷ, why do we need multiple formulas? Why not always use L = |y − ŷ| and be done with it?
📐
Reason 1 — Different tasks have different definitions of "wrong"
Predicting a house price is different from predicting whether an email is spam. A price can be slightly off; a spam classification is either correct or wrong. The loss formula needs to reflect what "wrong" means for your specific task.
🎯
Reason 2 — Outliers need different treatment
If one data point has a huge error (an outlier), should it dominate training or be treated like any other error? Different formulas give different answers — and the right answer depends on your data.
📉
Reason 3 — The gradient needs the right shape
Gradient descent follows the slope of the loss curve. Different formulas produce different slopes — and the wrong slope can make training impossibly slow or push the network in the wrong direction entirely.
Here is a concrete example of why it matters: if you use MSE (mean squared error) for a classification task, the gradient near the correct answer is nearly zero — the network stops learning even when it is still getting things wrong. Cross-entropy doesn't have this problem. Same task, different loss, completely different training behaviour.
Task-to-loss mapping
The most important question when choosing a loss function is: what kind of output is my model producing? Every task type has a natural loss function — using it is not a convention, it is mathematically the correct choice.
Regression
Predicting a continuous number
House price · Temperature · Stock return · Age
Classification
Predicting a class or probability
Cat or dog · Spam or not · Which digit · Sentiment
The four loss functions — interactive deep dive
Select a loss function to see its equation, its shape, and what it penalises differently from the others.
Interactive: drag the slider to see where your error falls on the curve
Error (y − ŷ)0.50
Input value
0.50
Loss value
0.25
Geometric interpretation — what the loss looks like on real data
MSE is literally the average area of the shaded squares — one square per data point, with side length equal to the error. Larger errors grow the square area quadratically. Drag the slider to move the regression line and watch all the squares update live.
Regression line offset0.35
Total loss
—
Residuals
—
Calculation (mean over 8 errors)
MSE vs MAE vs Huber — the outlier problem
The biggest practical difference between these three loss functions is how they react to outliers — data points with unusually large errors.
In the interactive below, the dataset has 6 error values — 5 small base errors [0.1, −0.2, 0.15, −0.1, 0.05] plus one outlier whose magnitude you control with the slider. The displayed losses are the mean across all 6 errors. Drag the outlier slider to see how each loss function reacts.
Outlier magnitude1.0
Calculation (mean over 6 errors)
Loss summary
MSE
—
MAE
—
Huber
—
Decision guide — which loss to use
A practical reference for choosing the right loss function for your task.
Loss function
Formula
Task
Outlier sensitivity
Use when
MSE
(y − ŷ)²
Regression
High — squares errors
Outliers are genuine signal, not noise. Clean data.
MAE
|y − ŷ|
Regression
Low — linear penalty
Data has real outliers you want to ignore. Median behaviour.
Huber
½e² or δ(|e|−½δ)
Regression
Medium — best of both
Uncertain about outlier prevalence. Robust regression.
Binary CE
−[y·log(ŷ)+(1−y)·log(1−ŷ)]
Classification
High for confident wrong
Two-class problems. Output is a probability via sigmoid.
Categorical CE
−Σ yᵢ·log(pᵢ)
Classification
High for confident wrong
Multi-class problems. Output is softmax probabilities.
The golden rule: if your output is a continuous number, use MSE / MAE / Huber. If your output is a class label or probability, use cross-entropy. Never use MSE for classification — it produces the wrong gradient shape and leads to poor training even when the model is nearly correct.