Artificial Intelligence 7 min read

Mastering Activation Functions: From Sigmoid to Swish and When to Use Them

This article explains the role of activation functions in neural networks, compares five classic functions with formulas, performance trade‑offs, and gradient behavior, and provides a Python visualization demo plus several practical insights and real‑world examples.

Qborfy AI

Jul 2, 2025

Mastering Activation Functions: From Sigmoid to Swish and When to Use Them

Activation functions act as the "intelligent switches" of neural networks, mapping the linear combination f(x) = wx + b to a non‑linear output for each neuron.

What are activation functions?

Introduce non‑linearity so a network can approximate arbitrary complex functions (otherwise a deep stack collapses to a single linear model).

Mathematical form: Output = f(\sum_i w_i x_i + b).

Filter features (e.g., ReLU discards negative values) and regulate gradient magnitude to avoid vanishing or exploding updates.

Like a biological neuron that fires only when its membrane potential exceeds a threshold, an activation function decides whether a signal propagates to the next layer.

Classic activation functions

Sigmoid : f(x) = 1/(1+exp(-x)). Output range (0, 1). Common in binary‑classification output layers. Drawback: severe gradient vanishing in saturated regions.

Tanh : f(x) = (exp(x)-exp(-x))/(exp(x)+exp(-x)). Output range (‑1, 1). Frequently used in RNN/LSTM hidden layers. Still suffers from gradient vanishing.

ReLU : f(x) = max(0, x). Output range [0, ∞). Default for CNN and Transformer hidden layers. Eliminates vanishing gradients but can produce "dead" neurons that never activate.

Leaky ReLU : f(x) = max(0.01·x, x). Output range [0, ∞). Keeps a small gradient for negative inputs, mitigating dead ReLU at the cost of a hyper‑parameter (leak factor) that can be sensitive.

Swish : f(x) = x·σ(βx) where σ is the sigmoid and β is learnable. Proposed by Google Brain; outperforms ReLU on MobileNetV3 despite a modest increase in compute cost.

Performance comparison

Sigmoid – gradient vanishing: severe; compute efficiency: ★★☆; output centered: no; SOTA accuracy ≈ 60 %; main issue – vanishing gradients.

Tanh – gradient vanishing: moderate; compute efficiency: ★★☆; output centered: yes; SOTA accuracy ≈ 75 %; main issue – vanishing gradients.

ReLU – gradient vanishing: none; compute efficiency: ★★★★★; output centered: no; SOTA accuracy ≈ 90 %; main issue – dead ReLU.

Leaky ReLU – gradient vanishing: none; compute efficiency: ★★★★☆; output centered: no; SOTA accuracy ≈ 92 %; main issue – sensitivity to leak factor.

Swish – gradient vanishing: none; compute efficiency: ★★★☆; output centered: no; SOTA accuracy ≈ 95 %; main issue – slightly higher computational complexity.

Understanding gradients

Gradient = direction and magnitude of parameter updates during back‑propagation.

When gradients become very small (e.g., in the saturation zones of Sigmoid or Tanh), learning slows dramatically or stalls, preventing convergence.

Hands‑on experiment

# Activation function visualization tool
import numpy as np
import matplotlib.pyplot as plt

x = np.linspace(-5, 5, 100)
functions = {
    'Sigmoid': lambda x: 1/(1+np.exp(-x)),
    'Tanh': np.tanh,
    'ReLU': lambda x: np.maximum(0, x),
    'Swish': lambda x: x/(1+np.exp(-x))
}
plt.figure(figsize=(10,6))
for name, func in functions.items():
    plt.plot(x, func(x), label=name, lw=3)
plt.legend()
plt.show()

Observation focus: Sigmoid/Tanh saturation zones (flat tails) reveal the root cause of gradient vanishing. ReLU truncates negative inputs, making dead ReLU visually apparent.

Additional observations

Neuron activation‑rate experiment: Sigmoid activates only 3‑5 % of neurons, whereas ReLU activates roughly 50 %, leading to more efficient resource use.

Biochemical inspiration: Swish’s smooth curve is modeled after ion‑channel dynamics in biological synapses.

Automated search: Google used reinforcement learning over 100 k candidate functions to discover Swish, surpassing manually designed functions.

Cosmology‑scale use: CERN employs the GELU (Gaussian Error Linear Unit) activation in particle‑collision pipelines, reducing error by 38 %.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning neural networks activation functions ReLU Swish

Written by

Qborfy AI

A knowledge base that logs daily experiences and learning journeys, sharing them with you to grow together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.