Demystifying Recurrent Neural Networks: Theory, Training, and Implementation
This article explains the fundamentals of recurrent neural networks (RNNs), their role in language modeling, various RNN architectures such as bidirectional and deep RNNs, the back‑propagation through time (BPTT) training algorithm, gradient challenges, vectorization techniques, and provides a step‑by‑step code implementation.
Introduction
Previous articles covered fully connected and convolutional neural networks, which process each input independently. Many tasks, however, require handling sequential data where earlier inputs influence later ones, such as natural language understanding and video analysis. Recurrent Neural Networks (RNNs) address this need.
RNNs and Language Modeling
RNNs were first applied to natural language processing as language models. A language model predicts the next word given a preceding word sequence. For example, given "I was late for school, the teacher criticized ___," the model should predict "me" rather than unrelated words.
Before RNNs, n‑gram models (e.g., 2‑gram, 3‑gram) approximated word probabilities based on a fixed number of preceding words, which fails for long‑range dependencies because the model ignores most of the context.
Basic Recurrent Neural Network
A simple RNN consists of an input layer, a hidden layer, and an output layer. The hidden state s at time t depends on the current input x_t and the previous hidden state s_{t-1}. The weight matrix W connects the previous hidden state to the current one, while U connects the input to the hidden layer and V connects the hidden layer to the output.
Unfolding the network over time yields the following equations (illustrated in the images below):
Equation 1 computes the output layer, and Equation 2 computes the hidden layer. The key difference from a fully connected network is the additional recurrent weight matrix W.
Bidirectional RNN
Standard RNNs cannot use future context. A bidirectional RNN processes the sequence forward and backward, producing two hidden states ( A and A') whose concatenation yields the final output. This architecture improves tasks like filling a missing word where both left and right context matter.
Deep RNN
Stacking multiple recurrent layers creates a deep RNN, enabling the network to learn hierarchical temporal features.
Training RNNs: BPTT
The Back‑Propagation Through Time (BPTT) algorithm extends standard back‑propagation to recurrent connections. It consists of three steps:
Forward pass: compute the output of each neuron at every time step.
Backward pass: compute the error term for each neuron (the derivative of the loss with respect to the weighted input).
Gradient computation: calculate gradients for all weight matrices ( U, W, V).
Weight updates are performed with stochastic gradient descent.
Gradient Explosion and Vanishing
RNNs suffer from exploding or vanishing gradients when processing long sequences. Exploding gradients can be mitigated by gradient clipping, while vanishing gradients are harder to detect. Common remedies include careful weight initialization, using ReLU instead of sigmoid/tanh, or switching to gated architectures such as LSTM or GRU.
Vectorizing Words for Language Models
Words are converted to high‑dimensional one‑hot vectors based on a vocabulary dictionary. Because one‑hot vectors are sparse and large, dimensionality reduction (e.g., embeddings) is often applied, though this article does not cover that step.
Softmax Output Layer
The softmax function converts the network’s raw output vector into a probability distribution over the vocabulary.
For an input vector x=[1,2,3,4], softmax produces y=[0.03,0.09,0.24,0.64], which can be interpreted as the probabilities of each word being the next token.
Training Procedure for the Language Model
Training uses supervised learning with cross‑entropy loss. The steps are:
Create input‑label pairs from the corpus (e.g., "我 昨天 上学 迟到 了" → next word).
Vectorize both inputs and labels into one‑hot vectors.
Feed the vectors into the RNN and compute the softmax output.
Calculate cross‑entropy loss and back‑propagate using BPTT.
Update weights with stochastic gradient descent.
Implementation Details
The article provides a Python‑style implementation of an RecurrentLayer class. The class stores two weight matrices U (input‑to‑hidden) and W (hidden‑to‑hidden). The forward method computes the hidden state at each time step, while the backward method implements BPTT. Gradient descent is performed in an update method, and a reset_state method clears the hidden state between sequences.
Gradient checking code verifies that the analytical gradients match numerical approximations.
Conclusion
The tutorial covered basic RNN architecture, BPTT training, and its application to language modeling, as well as common issues like gradient explosion/vanishing. While basic RNNs have limitations, their gated variants (LSTM, GRU) handle long‑range dependencies more effectively and will be explored in future articles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
