Activation Functions Explainer

by Ancil Cleetus

If a neuron were just a weighted sum, stacking layers would be pointless — the composition of linear functions is still linear. Activation functions introduce nonlinearity, giving neural networks the power to learn complex patterns.

Why we need nonlinearity

Without activation functions, any depth of network collapses into a single linear transformation — no matter how many layers you add.

Why does this matter? A linear function can only draw straight lines (or flat planes). But real-world patterns — recognising a face, understanding a sentence, detecting a tumour — are curved and complex. Without nonlinearity, no matter how many layers you stack, the network can only fit a straight line through data. Activation functions are what let it bend.
Without activation functions
Linear
Linear
Linear
↓ collapses to
Still linear!

z₂ = W₂(W₁x + b₁) + b₂ = (W₂W₁)x + const
→ just one linear function, no matter how many layers

With activation functions
Lin
f(·)
Lin
f(·)
↓ produces
Powerful nonlinear!

Each activation layer adds expressive power — the network can approximate any continuous function.

This is the Universal Approximation Theorem: a neural network with at least one hidden layer and a nonlinear activation function can approximate any continuous function to arbitrary precision, given enough neurons.

Interactive curve explorer

Select an activation function, drag the slider to set an input z, and watch the output and gradient update live on the curve.

What is z? z is the weighted sum that a neuron computes — the raw number produced by multiplying inputs by weights and adding a bias. The activation function then transforms this raw number into the neuron's final output.
Neuron input z 0.00
Neuron input (z)
0.00
Neuron output f(z)
0.50
Learning signal f′(z)
0.25

Sigmoid

Sigmoid converts any number into a value between 0 and 1 — think of it as turning a raw score into a probability. Feed it 0 and it returns 0.5. Feed it a large positive number and it returns close to 1. Feed it a large negative number and it returns close to 0. It was historically popular but suffers from the vanishing gradient problem for very positive or negative inputs.

σ(z) = 1 / (1 + e⁻ᶻ)
Derivative: σ′(z) = σ(z) · (1 − σ(z)) — maximum gradient of 0.25 at z = 0.
Properties
Output range
(0, 1)
Zero-centred
No
Max gradient
0.25 at z = 0
Differentiable
Everywhere
Problem: For large |z|, the gradient approaches zero — learning becomes extremely slow. This is called the vanishing gradient problem. Think of it like trying to steer a car when the steering wheel barely responds — technically connected, but effectively useless. In deep networks, this prevents gradients from flowing back through many layers to reach the early layers.
Best use: Output layer of binary classification networks, where the output should represent a probability between 0 and 1.

Tanh (hyperbolic tangent)

The tanh function is similar to sigmoid but outputs values in the range (−1, 1). Since its output is zero-centred, it often works better than sigmoid in hidden layers.

tanh(z) = (eᶻ − e⁻ᶻ) / (eᶻ + e⁻ᶻ)
Relationship to sigmoid: tanh(z) = 2σ(2z) − 1 — essentially a rescaled, shifted sigmoid.
Tanh vs sigmoid — key differences
Output range
(−1, 1)
Zero-centred
Yes
Max gradient
1.0 at z = 0
Steeper near origin
Yes — 4× sigmoid
Zero-centred means the output is balanced around zero — some values are positive, some negative. This matters because if a neuron always outputs positive numbers (like sigmoid does), the weight updates in the layer below are always pushed in the same direction, causing an inefficient zig-zag pattern during training. Tanh avoids this by outputting values that range from −1 to +1.
Problem: Like sigmoid, tanh still saturates for large |z|, causing vanishing gradients. It is not the answer for very deep networks.
Best use: Sequence models like RNNs and LSTMs (networks that process text, speech, or time-series data one step at a time), where zero-centred activations improve gradient flow across long sequences.

The ReLU family

ReLU and its variants are the most widely used activation functions in modern deep learning. They are simple, fast, and largely avoid the vanishing gradient problem for positive inputs.

Select a function to explore
ReLU
f(z) = max(0, z)
Leaky ReLU
f(z) = max(0.01z, z)
GELU
f(z) = z · Φ(z)

GELU deep dive

The Gaussian Error Linear Unit is the activation function of choice in modern transformers — GPT, BERT, and their successors all use GELU.

GELU(z) = z · Φ(z)
Approximation: 0.5z(1 + tanh[√(2/π)(z + 0.044715z³)])
where Φ(z) is the standard Gaussian CDF.
Why GELU beats ReLU in transformers
Smooth gating, not hard cut-off

ReLU makes a hard decision at zero — below zero, signal is completely blocked. GELU smoothly gates inputs by their magnitude: larger inputs are more likely passed through, smaller or negative inputs are gradually suppressed. This smooth gradient everywhere improves training stability.

Stochastic regularization

GELU acts as a stochastic regularizer during training. For inputs near zero, the function behaves probabilistically — sometimes passing the signal, sometimes blocking it — which works similarly to dropout but within the activation itself.

No dead neurons

ReLU neurons that output zero can never recover — their gradient is permanently zero. GELU always has a non-zero gradient everywhere, so every neuron can always learn.

Interactive: ReLU vs GELU at the boundary

Drag the slider near zero to see how differently ReLU and GELU behave in the critical region where hard vs soft gating matters most.

z −0.50
ReLU output
0.0000
Hard block
GELU output
−0.1543
Smooth suppression

Activation function summary

A quick reference covering all five functions — range, zero-centring, gradient issues, and where each is used today.

Function Range Zero-centred Gradient issues Modern use
Sigmoid (0, 1) No Vanishing Output layer (binary)
Tanh (−1, 1) Yes Vanishing RNNs, LSTMs
ReLU [0, ∞) No Dead neurons CNNs, general use
Leaky ReLU (−∞, ∞) No Minimal GANs, deep networks
GELU ≈(−0.17, ∞) No Minimal Transformers (GPT, BERT)
There is no single best activation function — the right choice depends on the architecture and task. ReLU remains the default for hidden layers in CNNs. GELU is the default for transformers. Sigmoid and tanh remain relevant in specific output and recurrent contexts.