Artificial Intelligence 21 min read

Demystifying Recurrent Neural Networks: Theory, Training, and Implementation

This article explains the fundamentals of recurrent neural networks (RNNs), their role in language modeling, various RNN architectures such as bidirectional and deep RNNs, the back‑propagation through time (BPTT) training algorithm, gradient challenges, vectorization techniques, and provides a step‑by‑step code implementation.

dbaplus Community

Nov 10, 2016

Demystifying Recurrent Neural Networks: Theory, Training, and Implementation

Introduction

Previous articles covered fully connected and convolutional neural networks, which process each input independently. Many tasks, however, require handling sequential data where earlier inputs influence later ones, such as natural language understanding and video analysis. Recurrent Neural Networks (RNNs) address this need.

RNNs and Language Modeling

RNNs were first applied to natural language processing as language models. A language model predicts the next word given a preceding word sequence. For example, given "I was late for school, the teacher criticized ___," the model should predict "me" rather than unrelated words.

Before RNNs, n‑gram models (e.g., 2‑gram, 3‑gram) approximated word probabilities based on a fixed number of preceding words, which fails for long‑range dependencies because the model ignores most of the context.

Basic Recurrent Neural Network

A simple RNN consists of an input layer, a hidden layer, and an output layer. The hidden state s at time t depends on the current input x_t and the previous hidden state s_{t-1}. The weight matrix W connects the previous hidden state to the current one, while U connects the input to the hidden layer and V connects the hidden layer to the output.

Unfolding the network over time yields the following equations (illustrated in the images below):

Equation 1 computes the output layer, and Equation 2 computes the hidden layer. The key difference from a fully connected network is the additional recurrent weight matrix W.

Bidirectional RNN

Standard RNNs cannot use future context. A bidirectional RNN processes the sequence forward and backward, producing two hidden states ( A and A') whose concatenation yields the final output. This architecture improves tasks like filling a missing word where both left and right context matter.

Deep RNN

Stacking multiple recurrent layers creates a deep RNN, enabling the network to learn hierarchical temporal features.

Training RNNs: BPTT

The Back‑Propagation Through Time (BPTT) algorithm extends standard back‑propagation to recurrent connections. It consists of three steps:

Forward pass: compute the output of each neuron at every time step.

Backward pass: compute the error term for each neuron (the derivative of the loss with respect to the weighted input).

Gradient computation: calculate gradients for all weight matrices ( U, W, V).

Weight updates are performed with stochastic gradient descent.

Gradient Explosion and Vanishing

RNNs suffer from exploding or vanishing gradients when processing long sequences. Exploding gradients can be mitigated by gradient clipping, while vanishing gradients are harder to detect. Common remedies include careful weight initialization, using ReLU instead of sigmoid/tanh, or switching to gated architectures such as LSTM or GRU.

Vectorizing Words for Language Models

Words are converted to high‑dimensional one‑hot vectors based on a vocabulary dictionary. Because one‑hot vectors are sparse and large, dimensionality reduction (e.g., embeddings) is often applied, though this article does not cover that step.

Softmax Output Layer

The softmax function converts the network’s raw output vector into a probability distribution over the vocabulary.

For an input vector x=[1,2,3,4], softmax produces y=[0.03,0.09,0.24,0.64], which can be interpreted as the probabilities of each word being the next token.

Training Procedure for the Language Model

Training uses supervised learning with cross‑entropy loss. The steps are:

Create input‑label pairs from the corpus (e.g., "我昨天上学迟到了" → next word).

Vectorize both inputs and labels into one‑hot vectors.

Feed the vectors into the RNN and compute the softmax output.

Calculate cross‑entropy loss and back‑propagate using BPTT.

Update weights with stochastic gradient descent.

Implementation Details

The article provides a Python‑style implementation of an RecurrentLayer class. The class stores two weight matrices U (input‑to‑hidden) and W (hidden‑to‑hidden). The forward method computes the hidden state at each time step, while the backward method implements BPTT. Gradient descent is performed in an update method, and a reset_state method clears the hidden state between sequences.

Gradient checking code verifies that the analytical gradients match numerical approximations.

Conclusion

The tutorial covered basic RNN architecture, BPTT training, and its application to language modeling, as well as common issues like gradient explosion/vanishing. While basic RNNs have limitations, their gated variants (LSTM, GRU) handle long‑range dependencies more effectively and will be explored in future articles.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Deep Learning Language Model RNN Recurrent Neural Network gradient vanishing Softmax BPTT

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.