Loss Functions Overview Explainer

by Ancil Cleetus

Imagine teaching a child by never telling them when they are wrong. They would never improve. A neural network is the same — it needs a way to measure how wrong its predictions are. That measurement is called the loss. This explainer builds that intuition from scratch, then shows you which loss function to use for different tasks.

What is loss?

Phase A — The intuition

You are training a neural network to predict house prices. It looks at a house and outputs a prediction. Reality then tells it the actual sale price. The gap between the prediction and reality is the loss.

Interactive: drag the prediction to see the gap

Network predicts £300k

When the prediction exactly matches reality, the loss is zero — perfect. The further off the prediction, the larger the loss. The network uses this loss signal to adjust itself and improve. No loss signal = no learning.

Phase B — Introducing the notation

In machine learning we write the true answer as y and the model's prediction as ŷ (said "y-hat"). The loss L measures the gap between them. A loss function is just a formula that converts that gap into a single number.

True answer
(reality)

Prediction
(model output)

Loss
(gap formula)

True value y 0.70

Prediction ŷ 0.30

Gap (y − ŷ)

0.40

|Gap| = raw error

0.40

Three roles a loss function plays

Measuring instrument

Quantifies how wrong the current predictions are. Without a loss, the network has no signal to learn from.

Training objective

The network's entire job during training is to reduce this number. Gradient descent moves the weights in whatever direction makes the loss smaller.

Your design choice

You choose which formula to use. Different tasks need different formulas. Choosing the wrong one produces a network that optimises the wrong thing — even if it appears to train correctly.

Why different formulas? Why can't we just use the gap?

If loss just measures the gap between y and ŷ, why do we need multiple formulas? Why not always use L = |y − ŷ| and be done with it?

📐

Reason 1 — Different tasks have different definitions of "wrong"

Predicting a house price is different from predicting whether an email is spam. A price can be slightly off; a spam classification is either correct or wrong. The loss formula needs to reflect what "wrong" means for your specific task.

🎯

Reason 2 — Outliers need different treatment

If one data point has a huge error (an outlier), should it dominate training or be treated like any other error? Different formulas give different answers — and the right answer depends on your data.

📉

Reason 3 — The gradient needs the right shape

Gradient descent follows the slope of the loss curve. Different formulas produce different slopes — and the wrong slope can make training impossibly slow or push the network in the wrong direction entirely.

Here is a concrete example of why it matters: if you use MSE (mean squared error) for a classification task, the gradient near the correct answer is nearly zero — the network stops learning even when it is still getting things wrong. Cross-entropy doesn't have this problem. Same task, different loss, completely different training behaviour.

Task-to-loss mapping

The most important question when choosing a loss function is: what kind of output is my model producing? Every task type has a natural loss function — using it is not a convention, it is mathematically the correct choice.

Regression

Predicting a continuous number

House price · Temperature · Stock return · Age

Classification

Predicting a class or probability

Cat or dog · Spam or not · Which digit · Sentiment

The four loss functions — interactive deep dive

Select a loss function to see its equation, its shape, and what it penalises differently from the others.

Interactive: drag the slider to see where your error falls on the curve

Error (y − ŷ) 0.50

Input value

0.50

Loss value

0.25

Geometric interpretation — what the loss looks like on real data

MSE is literally the average area of the shaded squares — one square per data point, with side length equal to the error. Larger errors grow the square area quadratically. Drag the slider to move the regression line and watch all the squares update live.

Regression line offset 0.35

Total loss

—

Residuals

—

Calculation (mean over 8 errors)

MSE vs MAE vs Huber — the outlier problem

The biggest practical difference between these three loss functions is how they react to outliers — data points with unusually large errors.

In the interactive below, the dataset has 6 error values — 5 small base errors [0.1, −0.2, 0.15, −0.1, 0.05] plus one outlier whose magnitude you control with the slider. The displayed losses are the mean across all 6 errors. Drag the outlier slider to see how each loss function reacts.

Outlier magnitude 1.0

Calculation (mean over 6 errors)

Loss summary

MSE

—

MAE

—

Huber

—

Decision guide — which loss to use

A practical reference for choosing the right loss function for your task.

Loss function	Formula	Task	Outlier sensitivity	Use when
MSE	(y − ŷ)²	Regression	High — squares errors	Outliers are genuine signal, not noise. Clean data.
MAE	\|y − ŷ\|	Regression	Low — linear penalty	Data has real outliers you want to ignore. Median behaviour.
Huber	½e² or δ(\|e\|−½δ)	Regression	Medium — best of both	Uncertain about outlier prevalence. Robust regression.
Binary CE	−[y·log(ŷ)+(1−y)·log(1−ŷ)]	Classification	High for confident wrong	Two-class problems. Output is a probability via sigmoid.
Categorical CE	−Σ yᵢ·log(pᵢ)	Classification	High for confident wrong	Multi-class problems. Output is softmax probabilities.

The golden rule: if your output is a continuous number, use MSE / MAE / Huber. If your output is a class label or probability, use cross-entropy. Never use MSE for classification — it produces the wrong gradient shape and leads to poor training even when the model is nearly correct.