Artificial Intelligence 13 min read

How Activation Functions Work in Deep Learning

This article explains the role of activation functions in deep learning, covering their definition, why they are needed, the main categories—including linear, binary step, and various non‑linear functions such as Sigmoid, TanH, ReLU, Leaky ReLU, ELU, Softmax and Swish—along with each function's mathematical form, advantages, disadvantages, and practical usage recommendations.

Code DAO

May 12, 2022

How Activation Functions Work in Deep Learning

Definition of Activation Function

In an artificial neural network each neuron forms a weighted sum of its inputs; the resulting scalar is passed through an activation function.

Purpose of Activation Functions

Activation functions decide whether a neuron should be activated and introduce non‑linearity, enabling the network to map input collections to output values.

Types of Activation Functions

Linear activation function

Binary step activation function

Non‑linear activation functions

Linear (Identity) Activation Function

The linear function is proportional to its input and ranges from –∞ to ∞. It simply adds the weighted sum and returns the result.

Mathematical expression:

Pros and cons:

Not binary; can be combined with multiple neurons for max or soft‑max operations.

Derivative is constant, so the gradient does not depend on the input value.

Binary Step Activation Function

The threshold determines whether the neuron fires. The function compares the input with the threshold; if the input exceeds the threshold the neuron activates, otherwise it remains inactive.

Mathematical expression:

Pros and cons:

Cannot produce multiple output values; unsuitable for multi‑class classification.

Gradient is zero, making back‑propagation difficult.

Non‑linear Activation Functions

Non‑linear functions are the most commonly used. They allow stacked layers to represent arbitrary functions as compositions of non‑linear transformations.

Sigmoid

Maps a numeric input to a value between 0 and 1. It is continuously differentiable, monotonic, and bounded.

Primarily used for binary classification, providing the probability of a specific class.

Mathematical expression:

Pros and cons:

Non‑linear and smooth, providing useful gradients for classification.

Output confined to (0, 1), defining a clear range.

Can cause gradient vanishing due to saturation regions.

Output is not zero‑centered, leading to inefficient gradient updates.

Training can be slow or stall.

TanH (Hyperbolic Tangent)

Compresses real values to the range [‑1, 1]. Its output is zero‑centered, allowing negative inputs to map to negative outputs.

Mathematical expression:

Pros and cons:

Suffers from gradient vanishing, but its gradient is steeper than Sigmoid’s.

Zero‑centered output reduces bias in gradient direction.

ReLU (Rectified Linear Unit)

Most widely used. Gradient is at most 1, preventing gradient vanishing, and the function never saturates because the slope never becomes zero. Output range is [0, ∞).

Mathematical expression:

Pros and cons:

Computationally efficient because only a subset of neurons are active.

Linear, non‑saturating nature speeds up convergence of gradient descent.

Can only be used in hidden layers.

Some gradients may be fragile during training.

For inputs x < 0 the gradient becomes zero, causing the “Dying ReLU” problem.

Leaky ReLU

Adds a small positive slope in the negative region to address the Dying ReLU issue.

Mathematical expression:

Pros and cons:

Retains ReLU’s advantages while allowing back‑propagation for negative inputs.

Negative inputs produce non‑zero gradients, preventing dead neurons.

Predictions for negative inputs may be less stable.

ELU (Exponential Linear Unit)

Introduces an α parameter multiplied with an exponential term for negative inputs, also mitigating the Dying ReLU problem.

Mathematical expression:

Pros and cons:

Can produce negative outputs, unlike ReLU.

Exponential computation adds slight overhead.

No learning of the α value and risk of gradient explosion.

Softmax

Converts raw scores into relative probabilities for each class; standard for the final layer in multi‑class classification.

Mathematical expression:

Pros and cons:

Better mimics a one‑hot encoded label than absolute values.

Preserves information that would be lost with plain magnitude values.

Also applicable to multi‑label classification and regression tasks.

Swish

Self‑gated activation created by Google researchers; passes a small amount of negative weight instead of zeroing all non‑positive values. The smooth, non‑monotonic shape helps very deep networks (≈ 40+ layers) learn more effectively.

Mathematical expression:

Pros and cons:

Provides a smooth transition around zero, unlike ReLU’s abrupt change.

Retains negative values that can be informative for pattern detection.

Non‑monotonicity enhances learning of input‑weight relationships.

Computational cost is slightly higher, potentially slowing training.

Practical Considerations

Gradient vanishing is common during training, especially for functions whose output range is small (e.g., 0‑1 for Sigmoid). Small output ranges produce tiny gradients, making optimization difficult; such functions are generally suitable only for shallow networks.

Gradient explosion occurs when erroneous gradients become excessively large, causing weight updates to overflow to NaN and destabilizing training.

Summary

Hidden layers typically share the same activation function; ReLU is recommended for hidden layers.

Sigmoid and TanH should be avoided in hidden layers because of gradient‑vanishing issues.

Swish is advantageous for networks deeper than about 40 layers.

Linear activation is suited for regression problems.

Sigmoid/Logistic is appropriate for binary classification.

Softmax is used for multi‑class classification.

Convolutional Neural Networks (CNN) commonly use ReLU.

Recurrent Neural Networks (RNN) often use TanH or Sigmoid.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

neural network Deep Learning activation function Sigmoid ReLU Swish

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.