Activation Functions — Interactive Explainer

Why we need nonlinearity

Without activation functions, any depth of network collapses into a single linear transformation — no matter how many layers you add.

Why does this matter? A linear function can only draw straight lines (or flat planes). But real-world patterns — recognising a face, understanding a sentence, detecting a tumour — are curved and complex. Without nonlinearity, no matter how many layers you stack, the network can only fit a straight line through data. Activation functions are what let it bend.

Without activation functions

Linear

→

Linear

→

Linear

↓ collapses to

Still linear!

z₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + const
→ just one linear function, no matter how many layers

With activation functions

Lin

→

f(·)

→

Lin

→

f(·)

↓ produces

Powerful nonlinear!

Each activation layer adds expressive power — the network can approximate any continuous function.

This is the Universal Approximation Theorem: a neural network with at least one hidden layer and a nonlinear activation function can approximate any continuous function to arbitrary precision, given enough neurons.

Interactive curve explorer

Select an activation function, drag the slider to set an input z, and watch the output and gradient update live on the curve.

What is z? z is the weighted sum that a neuron computes — the raw number produced by multiplying inputs by weights and adding a bias. The activation function then transforms this raw number into the neuron's final output.

Neuron input z 0.00

Neuron input (z)

0.00

Neuron output f(z)

0.50

Learning signal f′(z)

0.25

Sigmoid

Sigmoid converts any number into a value between 0 and 1 — think of it as turning a raw score into a probability. Feed it 0 and it returns 0.5. Feed it a large positive number and it returns close to 1. Feed it a large negative number and it returns close to 0. It was historically popular but suffers from the vanishing gradient problem for very positive or negative inputs.

σ(z) = 1 / (1 + e⁻ᶻ)

Derivative: σ′(z) = σ(z) · (1 − σ(z)) — maximum gradient of 0.25 at z = 0.

Properties

Output range

(0, 1)

Zero-centred

No

Max gradient

0.25 at z = 0

Differentiable

Everywhere

Problem: For large |z|, the gradient approaches zero — learning becomes extremely slow. This is called the vanishing gradient problem. Think of it like trying to steer a car when the steering wheel barely responds — technically connected, but effectively useless. In deep networks, this prevents gradients from flowing back through many layers to reach the early layers.

Best use: Output layer of binary classification networks, where the output should represent a probability between 0 and 1.

Tanh (hyperbolic tangent)

The tanh function is similar to sigmoid but outputs values in the range (−1, 1). Since its output is zero-centred, it often works better than sigmoid in hidden layers.

tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)

Relationship to sigmoid: tanh(z) = 2σ(2z) − 1 — essentially a rescaled, shifted sigmoid.

Tanh vs sigmoid — key differences

Output range

(−1, 1)

Zero-centred

Yes

Max gradient

1.0 at z = 0

Steeper near origin

Yes — 4× sigmoid

Zero-centred means the output is balanced around zero — some values are positive, some negative. This matters because if a neuron always outputs positive numbers (like sigmoid does), the weight updates in the layer below are always pushed in the same direction, causing an inefficient zig-zag pattern during training. Tanh avoids this by outputting values that range from −1 to +1.

Problem: Like sigmoid, tanh still saturates for large |z|, causing vanishing gradients. It is not the answer for very deep networks.

Best use: Sequence models like RNNs and LSTMs (networks that process text, speech, or time-series data one step at a time), where zero-centred activations improve gradient flow across long sequences.

The ReLU family

ReLU and its variants are the most widely used activation functions in modern deep learning. They are simple, fast, and largely avoid the vanishing gradient problem for positive inputs.

Select a function to explore

ReLU

f(z) = max(0, z)

Leaky ReLU

f(z) = max(0.01z, z)

GELU

f(z) = z · Φ(z)

GELU deep dive

The Gaussian Error Linear Unit is the activation function of choice in modern transformers — GPT, BERT, and their successors all use GELU.

GELU(z) = z · Φ(z)

Approximation: 0.5z(1 + tanh[√(2/π)(z + 0.044715z³)])
where Φ(z) is the standard Gaussian CDF.

Why GELU beats ReLU in transformers

Smooth gating, not hard cut-off

ReLU makes a hard decision at zero — below zero, signal is completely blocked. GELU smoothly gates inputs by their magnitude: larger inputs are more likely passed through, smaller or negative inputs are gradually suppressed. This smooth gradient everywhere improves training stability.

Stochastic regularization

GELU acts as a stochastic regularizer during training. For inputs near zero, the function behaves probabilistically — sometimes passing the signal, sometimes blocking it — which works similarly to dropout but within the activation itself.

No dead neurons

ReLU neurons that output zero can never recover — their gradient is permanently zero. GELU always has a non-zero gradient everywhere, so every neuron can always learn.

Interactive: ReLU vs GELU at the boundary

Drag the slider near zero to see how differently ReLU and GELU behave in the critical region where hard vs soft gating matters most.

z −0.50

ReLU output

0.0000

Hard block

GELU output

−0.1543

Smooth suppression

Activation function summary

A quick reference covering all five functions — range, zero-centring, gradient issues, and where each is used today.

Function	Range	Zero-centred	Gradient issues	Modern use
Sigmoid	(0, 1)	No	Vanishing	Output layer (binary)
Tanh	(−1, 1)	Yes	Vanishing	RNNs, LSTMs
ReLU	[0, ∞)	No	Dead neurons	CNNs, general use
Leaky ReLU	(−∞, ∞)	No	Minimal	GANs, deep networks
GELU	≈(−0.17, ∞)	No	Minimal	Transformers (GPT, BERT)

There is no single best activation function — the right choice depends on the architecture and task. ReLU remains the default for hidden layers in CNNs. GELU is the default for transformers. Sigmoid and tanh remain relevant in specific output and recurrent contexts.