How LSTM Achieves Long‑Term Memory: Gates, Activations & Variants Explained
This article explains how LSTM networks overcome RNN limitations by using input, forget, and output gates with sigmoid and tanh activations, describes the core update equations, discusses alternative activation functions and hard‑gate variants, and provides references for deeper study.
Scene Description
Memory‑capable networks are a key research area in deep learning. Traditional RNNs suffer from vanishing and exploding gradients and struggle to retain long‑term dependencies, limiting their practical performance.
LSTM (Long Short‑Term Memory) is the most successful RNN extension; it can store valuable information for long periods and selectively forget irrelevant data, leading to breakthroughs in speech recognition, machine translation, image captioning, named‑entity recognition, and more.
Problem Description
How does LSTM implement long‑term memory? Which activation functions are used in each of its modules, and can alternative functions be employed?
Answer and Analysis
1. How LSTM Implements Long‑Term Memory
Understanding LSTM requires familiarity with its architecture, typically illustrated as follows:
Compared with a vanilla RNN, LSTM still computes the hidden state \(h_t\) from the current input \(x_t\) and the previous hidden state \(h_{t-1}\), but it introduces a cell state \(c_t\) and three gated mechanisms: input gate \(i_t\), forget gate \(f_t\), and output gate \(o_t\). These gates regulate the flow of information into, out of, and within the cell.
i_t = σ(W_i x_t + U_i h_{t-1} + b_i)
f_t = σ(W_f x_t + U_f h_{t-1} + b_f)
o_t = σ(W_o x_t + U_o h_{t-1} + b_o)
\tilde{c}_t = tanh(W_c x_t + U_c h_{t-1} + b_c)
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t
h_t = o_t \odot tanh(c_t)Here, \(σ\) denotes the sigmoid function and \(\odot\) element‑wise multiplication. The input gate decides how much new information to write to the cell, the forget gate determines how much of the previous cell state to retain, and the output gate controls how much of the cell state influences the hidden output.
When a trained LSTM processes a sequence without important signals, the forget gate approaches 1 and the input gate approaches 0, preserving past memory. Conversely, when a salient token appears, the input gate rises toward 1 to store it, and the forget gate may drop toward 0 to discard outdated memory, thereby achieving long‑term dependency learning.
2. Activation Functions Used in LSTM Modules
The three gates (input, forget, output) employ the sigmoid activation because its output range (0–1) naturally models a gating mechanism. The candidate cell state \(\tilde{c}_t\) uses the hyperbolic tangent (tanh) activation, yielding values in \([-1, 1]\), which aligns with the typically zero‑centered distribution of hidden features and provides stronger gradients near zero.
Early LSTM variants used modified sigmoid functions such as \(h(x)=2·σ(x)-1\) and \(g(x)=4·σ(x)-2\), with ranges \([-1,1]\) and \([-2,2]\) respectively, and originally omitted a forget gate. Subsequent research demonstrated that adding a forget gate and using standard sigmoid/tanh improves performance, leading to the modern formulation.
In resource‑constrained scenarios (e.g., wearable devices), the exponential cost of sigmoid can be avoided by employing a "hard gate" that outputs binary 0 or 1 based on a threshold, reducing computation while maintaining acceptable accuracy.
Conclusion
Over two decades, LSTM’s core idea has remained consistent while its components have evolved. Understanding the gating mechanisms, activation choices, and possible variants enables practitioners to select or design the most suitable LSTM configuration for their tasks and to perform confidently in technical interviews.
References
Hochreiter, S., & Schmidhuber, J. (1997). Long short‑term memory. Neural Computation, 9(8), 1735‑1780.
Chung, J., et al. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
Gers, F. A., & Schmidhuber, J. (2000). Recurrent nets that time and count. IJCNN.
Gers, F. A., Schmidhuber, J., & Cummins, F. (1999). Learning to forget: Continual prediction with LSTM.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Hulu Beijing
Follow Hulu's official WeChat account for the latest company updates and recruitment information.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
