Artificial Intelligence 12 min read

Mastering LSTM: Architecture, Forward/Backward Computation, and Implementation

This article provides a comprehensive guide to Long Short-Term Memory networks, covering their motivation, detailed forward and backward equations, gate mechanisms, training algorithm, gradient checking, and a full C++ implementation, while also introducing the simpler GRU variant.

dbaplus Community

Jan 19, 2017

Mastering LSTM: Architecture, Forward/Backward Computation, and Implementation

After introducing the limitations of vanilla recurrent neural networks (RNNs) in handling long‑range dependencies, the article presents Long Short‑Term Memory (LSTM) networks as the most popular solution, widely used in speech recognition, image captioning, and natural language processing.

Background and Motivation

RNNs suffer from vanishing gradients: the gradient of the loss with respect to the weight matrix W is the sum of gradients over time steps, which quickly approaches zero for early steps, making the network ignore earlier states. Hochreiter and Schmidhuber introduced LSTM to address this by adding a cell state c that can preserve information over long periods.

LSTM Architecture

The cell state c is controlled by three gates—forget, input, and output—each implemented as a fully‑connected layer followed by a sigmoid activation. The article includes diagrams of the gate structures and the full forward‑pass equations (six formulas).

Forward Computation

For each time step, the gates are computed as:

Forget gate: f_t = sigmoid(W_f·[h_{t-1}, x_t] + b_f) Input gate: i_t = sigmoid(W_i·[h_{t-1}, x_t] + b_i) Candidate cell: \tilde{c}_t = tanh(W_c·[h_{t-1}, x_t] + b_c) Cell update: c_t = f_t * c_{t-1} + i_t * \tilde{c}_t Output gate: o_t = sigmoid(W_o·[h_{t-1}, x_t] + b_o) Hidden state: h_t = o_t * tanh(c_t) All gate calculations share the same structure, differing only in parameters and activation functions.

Training (Backward Propagation)

The training uses back‑propagation through time (BPTT). The article derives the gradients for each gate, showing how the error term is propagated backward across time steps (formulas 7‑12). It then explains how to accumulate gradients for weights and biases over all time steps, and provides the final expressions for weight and bias updates.

Gradient Checking

To verify the implementation, a gradient‑checking routine resets internal states, computes numerical gradients, and compares them with analytical gradients. Sample results are shown to confirm correctness.

Implementation Details

The code is organized in a LstmLayer class. Initialization creates matrices for parameters (weights, biases) and placeholders for intermediate results needed during back‑propagation. The forward method implements the equations above, while the backward method computes gate gradients, error propagation, and updates. A separate calc_gate helper reduces code duplication. Gradient descent updates are performed with a simple learning‑rate step, and the implementation includes optional gradient‑checking utilities.

GRU Variant

The article briefly introduces the Gated Recurrent Unit (GRU) as a simpler alternative to LSTM, replacing three gates with an update gate z_t and a reset gate r_t, and merging the cell and hidden states into a single state h. Forward equations and a schematic diagram are provided, but the training derivation is omitted.

Conclusion

LSTM, despite its complex structure, can be fully understood and implemented by following the presented formulas and code. After mastering LSTM, readers are encouraged to explore recursive neural networks for tree‑structured data in the next article.

Related topics and further reading links are listed at the end of the original article.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning neural networks implementation GRU LSTM Backpropagation RNN

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.