From RNNs to LSTMs and GRUs: A Hands‑On Guide to Sequence Modeling in PyTorch

This tutorial explains the nature of sequential data, why traditional feed‑forward networks struggle with it, and how recurrent architectures such as RNN, LSTM, and GRU capture temporal dependencies, complete with mathematical foundations, training algorithms, and full PyTorch implementations for sentiment analysis, text generation, and encoder‑decoder models.

AI Cyberspace
AI Cyberspace
AI Cyberspace
From RNNs to LSTMs and GRUs: A Hands‑On Guide to Sequence Modeling in PyTorch

What Is Sequence Data?

Sequence data is an ordered list of data points or events where each element has an inherent temporal relationship. Typical examples include natural‑language text, speech, video frames, stock prices, weather readings, and sensor streams. In NLP, tasks like translation and speech recognition rely heavily on processing such sequences.

Why Traditional Feed‑Forward Networks Fail

Feed‑forward neural networks (FFN) and convolutional neural networks (CNN) assume inputs are independent, so they cannot capture the order‑dependent patterns in sequences. For a sentence, an FFN would treat each word in isolation, losing crucial context.

Recurrent Neural Networks (RNN)

RNNs introduce a feedback loop that allows the network to retain information from previous time steps, effectively “remembering the past.” The core characteristics are:

Information Persistence : hidden states store historical information.

Feedback Loop : each neuron receives its previous hidden state together with the current input.

Capturing Dependencies : enables learning of short‑range dependencies, though long‑range learning is limited by gradient issues.

The basic RNN equations are:

h_t = f(W_hh * h_{t-1} + W_xh * x_t + b_h)
y_t = W_hy * h_t + b_y

where x_t is the input at time t, h_t the hidden state, and y_t the output.

Training RNNs with BPTT

Training uses Back‑Propagation Through Time (BPTT), which unfolds the network across time steps and applies standard back‑propagation to compute gradients for the shared weight matrices W_xh, W_hh, and W_hy.

PyTorch Example: Single‑Layer RNN for Sentiment Classification

The following code builds a minimal RNN that classifies very short Chinese sentences as positive or negative.

import torch
import torch.nn as nn
import torch.optim as optim

class SimpleRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super(SimpleRNN, self).__init__()
        self.W_xh = nn.Parameter(torch.randn(hidden_dim, input_dim))
        self.W_hh = nn.Parameter(torch.randn(hidden_dim, hidden_dim))
        self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim))
        self.b_h = nn.Parameter(torch.randn(hidden_dim, 1))
        self.b_y = nn.Parameter(torch.randn(output_dim, 1))
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

    def forward(self, x_seq):
        h_prev = torch.zeros(self.hidden_dim, 1)
        for t in range(x_seq.shape[0]):
            x_t = x_seq[t].unsqueeze(1)
            h_t = torch.tanh(self.W_hh @ h_prev + self.W_xh @ x_t + self.b_h)
            h_prev = h_t
        y_final = self.W_hy @ h_prev + self.b_y
        y_final = torch.log_softmax(y_final, dim=0)
        return y_final, h_prev

# Vocabulary and data preparation (one‑hot encoding)
vocab = {"电影":0, "好看":1, "饭菜":2, "难吃":3, "天气":4, "糟糕":5, "音乐":6, "好听":7, "剧情":8, "一般":9, "但":10, "演技":11, "好":12, "难看":13, "好吃":14, "心情":15, "分量":16, "少":17}
vocab_size = len(vocab)

train_data = [("电影 好看",1), ("饭菜 难吃",0), ("天气 糟糕",0), ("音乐 好听",1), ("电影 难看",0), ("饭菜 好吃",1)]

def text2onehot(text):
    words = text.split()
    seq = []
    for w in words:
        vec = torch.zeros(vocab_size)
        if w in vocab:
            vec[vocab[w]] = 1.0
        seq.append(vec)
    return torch.stack(seq)

input_dim = vocab_size
hidden_dim = 4
output_dim = 2
model = SimpleRNN(input_dim, hidden_dim, output_dim)
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)

epochs = 1000
for epoch in range(epochs):
    total_loss = 0.0
    for text, label in train_data:
        x_seq = text2onehot(text)
        y_true = torch.tensor([label])
        y_pred, _ = model(x_seq)
        y_pred = y_pred.squeeze(1).reshape(1,2)
        loss = criterion(y_pred, y_true)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch+1) % 100 == 0:
        print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_data):.4f}")

def predict(text):
    model.eval()
    with torch.no_grad():
        x_seq = text2onehot(text)
        y_pred, _ = model(x_seq)
        y_pred = y_pred.squeeze(1).reshape(1,2)
        pred = torch.argmax(y_pred).item()
        sentiment = "正面" if pred == 1 else "负面"
        return f"文本:{text} → 预测情感:{sentiment}(标签:{pred})"

print("
===== 可用示例(极短简单文本) =====")
print(predict("电影 好看"))
print(predict("饭菜 难吃"))
print(predict("音乐 好听"))

Training quickly reduces loss (e.g., Epoch 1000, Loss: 0.0007) and the model correctly classifies short sentences. Longer or contradictory sentences expose the RNN’s limitation.

Two‑Layer RNN for Text Generation

A deeper RNN can capture richer patterns. The example trains a two‑layer model on classical Chinese poems and generates new verses given a short prefix.

import torch
import torch.nn as nn
import torch.optim as optim

# Prepare character‑level data
text = """床前明月光,疑是地上霜。举头望明月,低头思故乡。..."""
text = text.replace("
", "").replace(" ", "")
chars = sorted(list(set(text)))
char2idx = {c:i for i,c in enumerate(chars)}
idx2char = {i:c for i,c in enumerate(chars)}
vocab_size = len(chars)
seq_len = 10

def build_data(text, seq_len):
    data = []
    for i in range(len(text)-seq_len):
        input_seq = text[i:i+seq_len]
        target_seq = text[i+1:i+seq_len+1]
        x = torch.tensor([char2idx[c] for c in input_seq], dtype=torch.long)
        y = torch.tensor([char2idx[c] for c in target_seq], dtype=torch.long)
        x_onehot = torch.eye(vocab_size)[x]
        data.append((x_onehot, y))
    return data

train_data = build_data(text, seq_len)

class SimpleTwoLayerRNN(nn.Module):
    def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
        super().__init__()
        self.W_xh1 = nn.Parameter(torch.randn(hidden_dim1, input_dim) * 0.1)
        self.W_hh1 = nn.Parameter(torch.randn(hidden_dim1, hidden_dim1) * 0.1)
        self.b_h1 = nn.Parameter(torch.zeros(hidden_dim1, 1))
        self.W_h1h2 = nn.Parameter(torch.randn(hidden_dim2, hidden_dim1) * 0.1)
        self.W_hh2 = nn.Parameter(torch.randn(hidden_dim2, hidden_dim2) * 0.1)
        self.b_h2 = nn.Parameter(torch.zeros(hidden_dim2, 1))
        self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim2) * 0.1)
        self.b_y = nn.Parameter(torch.zeros(output_dim, 1))

    def forward(self, x_seq):
        h1 = torch.zeros(self.W_hh1.shape[0], 1)
        h2 = torch.zeros(self.W_hh2.shape[0], 1)
        outputs = []
        for t in range(x_seq.shape[0]):
            x_t = x_seq[t].unsqueeze(1)
            h1 = torch.tanh(self.W_hh1 @ h1 + self.W_xh1 @ x_t + self.b_h1)
            h2 = torch.tanh(self.W_hh2 @ h2 + self.W_h1h2 @ h1 + self.b_h2)
            y_t = self.W_hy @ h2 + self.b_y
            y_t = torch.log_softmax(y_t.squeeze(1), dim=0)
            outputs.append(y_t)
        return torch.stack(outputs, dim=0)

hidden_dim1 = 24
hidden_dim2 = 12
model = SimpleTwoLayerRNN(vocab_size, hidden_dim1, hidden_dim2, vocab_size)
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.015)

for epoch in range(500):
    total_loss = 0.0
    for x, y in train_data:
        optimizer.zero_grad()
        preds = model(x)
        loss = criterion(preds, y)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    if (epoch+1) % 100 == 0:
        print(f"Epoch {epoch+1} | Avg Loss: {total_loss/len(train_data):.4f}")

def generate_text(prefix, max_len=20):
    model.eval()
    with torch.no_grad():
        gen = list(prefix)
        h1 = torch.zeros(hidden_dim1, 1)
        h2 = torch.zeros(hidden_dim2, 1)
        for c in prefix:
            x_t = torch.eye(vocab_size)[char2idx[c]].unsqueeze(1)
            h1 = torch.tanh(model.W_hh1 @ h1 + model.W_xh1 @ x_t + model.b_h1)
            h2 = torch.tanh(model.W_hh2 @ h2 + model.W_h1h2 @ h1 + model.b_h2)
        while len(gen) < max_len:
            last = gen[-1]
            x_t = torch.eye(vocab_size)[char2idx[last]].unsqueeze(1)
            h1 = torch.tanh(model.W_hh1 @ h1 + model.W_xh1 @ x_t + model.b_h1)
            h2 = torch.tanh(model.W_hh2 @ h2 + model.W_h1h2 @ h1 + model.b_h2)
            y_t = model.W_hy @ h2 + model.b_y
            next_idx = torch.argmax(y_t).item()
            gen.append(idx2char[next_idx])
        return "".join(gen)

print("
===== 文本生成结果(多前缀测试) =====")
for p in ["床前", "白日", "春眠", "千山"]:
    print(f"前缀:{p} → 生成:{generate_text(p, max_len=20)}")

The model learns character‑level patterns and can continue a given prefix with plausible poetic lines.

Input‑Output Structures in Sequence Modeling

Three common structures are described:

N‑N : equal‑length input and output sequences (e.g., part‑of‑speech tagging, language modeling).

1‑N : single input producing a longer output (e.g., image captioning, music generation).

N‑1 : multiple inputs collapsed into a single output (e.g., sequence classification, sentiment analysis).

Long‑Sequence Dependency Problems in RNNs

Even though RNNs are theoretically capable of learning arbitrarily long dependencies, in practice they suffer from gradient vanishing and exploding:

Gradient Vanishing

During back‑propagation, repeated multiplication by the derivative of tanh (which is <1) and the recurrent weight matrix W_hh causes gradients to shrink exponentially, making early time‑step information ineffective. The gradient ratio first_step_gradient / last_step_gradient often falls below 0.1 for long sequences.

Gradient Exploding

If W_hh has large eigenvalues, gradients can grow exponentially, leading to numerical overflow. Gradient clipping is a common remedy.

LSTM – Long Short‑Term Memory

Introduced in 1997, LSTM adds a memory cell C_t and three gates (forget, input, output) to regulate information flow, thereby mitigating vanishing gradients.

Gate Equations

f_t = σ(W_f * [h_{t-1}, x_t] + b_f)   # Forget gate
i_t = σ(W_i * [h_{t-1}, x_t] + b_i)   # Input gate
\tilde{C}_t = tanh(W_c * [h_{t-1}, x_t] + b_c)   # Candidate cell
C_t = f_t * C_{t-1} + i_t * \tilde{C}_t   # Cell update
o_t = σ(W_o * [h_{t-1}, x_t] + b_o)   # Output gate
h_t = o_t * tanh(C_t)                 # Hidden state

These mechanisms allow gradients to flow unchanged through the cell state when the forget gate is near 1.

PyTorch Implementation of a Single‑Layer LSTM

class SimpleLSTM(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.W_xi = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_hi = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_i = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_xf = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_hf = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_f = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_xo = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_ho = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_o = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_xc = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_hc = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_c = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim) * 0.1)
        self.b_y = nn.Parameter(torch.zeros(output_dim, 1))
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

    def forward(self, x_seq):
        h_prev = torch.zeros(self.hidden_dim, 1)
        c_prev = torch.zeros(self.hidden_dim, 1)
        for t in range(x_seq.shape[0]):
            x_t = x_seq[t].unsqueeze(1)
            i_t = torch.sigmoid(self.W_xi @ x_t + self.W_hi @ h_prev + self.b_i)
            f_t = torch.sigmoid(self.W_xf @ x_t + self.W_hf @ h_prev + self.b_f)
            o_t = torch.sigmoid(self.W_xo @ x_t + self.W_ho @ h_prev + self.b_o)
            c_tilde = torch.tanh(self.W_xc @ x_t + self.W_hc @ h_prev + self.b_c)
            c_t = f_t * c_prev + i_t * c_tilde
            h_t = o_t * torch.tanh(c_t)
            h_prev, c_prev = h_t, c_t
        y_final = self.W_hy @ h_prev + self.b_y
        y_final = torch.log_softmax(y_final, dim=0)
        return y_final, h_prev

The same sentiment‑analysis pipeline as the RNN example can be run with this LSTM class, yielding higher accuracy on longer or contradictory sentences.

GRU – Gated Recurrent Unit

Proposed in 2014, GRU merges the forget and input gates into an update gate and combines the cell and hidden state, resulting in fewer parameters.

Gate Equations

z_t = σ(W_z * [h_{t-1}, x_t] + b_z)   # Update gate
r_t = σ(W_r * [h_{t-1}, x_t] + b_r)   # Reset gate
\tilde{h}_t = tanh(W_h * [r_t * h_{t-1}, x_t] + b_h)
h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_t

PyTorch Implementation of a Single‑Layer GRU

class SimpleGRU(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.W_xz = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_hz = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_z = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_xr = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_hr = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_r = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_xh = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
        self.W_hh = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
        self.b_h = nn.Parameter(torch.zeros(hidden_dim, 1))
        self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim) * 0.1)
        self.b_y = nn.Parameter(torch.zeros(output_dim, 1))
        self.input_dim = input_dim
        self.hidden_dim = hidden_dim
        self.output_dim = output_dim

    def forward(self, x_seq):
        h_prev = torch.zeros(self.hidden_dim, 1)
        for t in range(x_seq.shape[0]):
            x_t = x_seq[t].unsqueeze(1)
            z_t = torch.sigmoid(self.W_xz @ x_t + self.W_hz @ h_prev + self.b_z)
            r_t = torch.sigmoid(self.W_xr @ x_t + self.W_hr @ h_prev + self.b_r)
            h_tilde = torch.tanh(self.W_xh @ x_t + self.W_hh @ (r_t * h_prev) + self.b_h)
            h_t = (1 - z_t) * h_prev + z_t * h_tilde
            h_prev = h_t
        y_final = self.W_hy @ h_prev + self.b_y
        y_final = torch.log_softmax(y_final, dim=0)
        return y_final, h_prev

When applied to the same sentiment task, GRU achieves performance comparable to LSTM with a simpler architecture.

LSTM vs. GRU – When to Use Which

LSTM excels on very long sequences or tasks requiring fine‑grained control over memory (e.g., long‑form translation, document‑level sentiment). GRU is faster and works well on short to medium‑length inputs such as tweets, real‑time chat, or small datasets.

RNN Encoder‑Decoder (Seq2Seq)

The encoder‑decoder framework uses two RNNs: the encoder compresses an input sequence of arbitrary length into a fixed‑size context vector C, and the decoder expands C into an output sequence, possibly of different length. This architecture underlies machine translation, summarization, and dialogue systems.

During decoding, the previous output token is fed back as input together with the hidden state, enabling auto‑regressive generation.

Seq2Seq and Its Generality

Google’s Seq2Seq model generalized the encoder‑decoder idea to any sequence‑to‑sequence task, not just translation. It treats tasks such as classification (output length = 1) or tagging (output length = input length) as special cases of Seq2Seq.

Limitations of RNN‑Based Seq2Seq

Before the Transformer, RNN‑based models suffered from two major drawbacks:

Inability to parallelize across time steps, leading to high training latency.

Difficulty handling very long‑range dependencies, causing performance to drop as input length grows.

These issues motivated the development of attention‑based architectures like the Transformer, which overcome both constraints.

RNN vs Transformer performance
RNN vs Transformer performance
PyTorchGRUEncoder-DecoderLSTMRNNSequence Modelinggradient vanishing
AI Cyberspace
Written by

AI Cyberspace

AI, big data, cloud computing, and networking.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.