From RNNs to LSTMs and GRUs: A Hands‑On Guide to Sequence Modeling in PyTorch
This tutorial explains the nature of sequential data, why traditional feed‑forward networks struggle with it, and how recurrent architectures such as RNN, LSTM, and GRU capture temporal dependencies, complete with mathematical foundations, training algorithms, and full PyTorch implementations for sentiment analysis, text generation, and encoder‑decoder models.
What Is Sequence Data?
Sequence data is an ordered list of data points or events where each element has an inherent temporal relationship. Typical examples include natural‑language text, speech, video frames, stock prices, weather readings, and sensor streams. In NLP, tasks like translation and speech recognition rely heavily on processing such sequences.
Why Traditional Feed‑Forward Networks Fail
Feed‑forward neural networks (FFN) and convolutional neural networks (CNN) assume inputs are independent, so they cannot capture the order‑dependent patterns in sequences. For a sentence, an FFN would treat each word in isolation, losing crucial context.
Recurrent Neural Networks (RNN)
RNNs introduce a feedback loop that allows the network to retain information from previous time steps, effectively “remembering the past.” The core characteristics are:
Information Persistence : hidden states store historical information.
Feedback Loop : each neuron receives its previous hidden state together with the current input.
Capturing Dependencies : enables learning of short‑range dependencies, though long‑range learning is limited by gradient issues.
The basic RNN equations are:
h_t = f(W_hh * h_{t-1} + W_xh * x_t + b_h) y_t = W_hy * h_t + b_ywhere x_t is the input at time t, h_t the hidden state, and y_t the output.
Training RNNs with BPTT
Training uses Back‑Propagation Through Time (BPTT), which unfolds the network across time steps and applies standard back‑propagation to compute gradients for the shared weight matrices W_xh, W_hh, and W_hy.
PyTorch Example: Single‑Layer RNN for Sentiment Classification
The following code builds a minimal RNN that classifies very short Chinese sentences as positive or negative.
import torch
import torch.nn as nn
import torch.optim as optim
class SimpleRNN(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super(SimpleRNN, self).__init__()
self.W_xh = nn.Parameter(torch.randn(hidden_dim, input_dim))
self.W_hh = nn.Parameter(torch.randn(hidden_dim, hidden_dim))
self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim))
self.b_h = nn.Parameter(torch.randn(hidden_dim, 1))
self.b_y = nn.Parameter(torch.randn(output_dim, 1))
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.output_dim = output_dim
def forward(self, x_seq):
h_prev = torch.zeros(self.hidden_dim, 1)
for t in range(x_seq.shape[0]):
x_t = x_seq[t].unsqueeze(1)
h_t = torch.tanh(self.W_hh @ h_prev + self.W_xh @ x_t + self.b_h)
h_prev = h_t
y_final = self.W_hy @ h_prev + self.b_y
y_final = torch.log_softmax(y_final, dim=0)
return y_final, h_prev
# Vocabulary and data preparation (one‑hot encoding)
vocab = {"电影":0, "好看":1, "饭菜":2, "难吃":3, "天气":4, "糟糕":5, "音乐":6, "好听":7, "剧情":8, "一般":9, "但":10, "演技":11, "好":12, "难看":13, "好吃":14, "心情":15, "分量":16, "少":17}
vocab_size = len(vocab)
train_data = [("电影 好看",1), ("饭菜 难吃",0), ("天气 糟糕",0), ("音乐 好听",1), ("电影 难看",0), ("饭菜 好吃",1)]
def text2onehot(text):
words = text.split()
seq = []
for w in words:
vec = torch.zeros(vocab_size)
if w in vocab:
vec[vocab[w]] = 1.0
seq.append(vec)
return torch.stack(seq)
input_dim = vocab_size
hidden_dim = 4
output_dim = 2
model = SimpleRNN(input_dim, hidden_dim, output_dim)
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1)
epochs = 1000
for epoch in range(epochs):
total_loss = 0.0
for text, label in train_data:
x_seq = text2onehot(text)
y_true = torch.tensor([label])
y_pred, _ = model(x_seq)
y_pred = y_pred.squeeze(1).reshape(1,2)
loss = criterion(y_pred, y_true)
optimizer.zero_grad()
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch+1) % 100 == 0:
print(f"Epoch {epoch+1}, Loss: {total_loss/len(train_data):.4f}")
def predict(text):
model.eval()
with torch.no_grad():
x_seq = text2onehot(text)
y_pred, _ = model(x_seq)
y_pred = y_pred.squeeze(1).reshape(1,2)
pred = torch.argmax(y_pred).item()
sentiment = "正面" if pred == 1 else "负面"
return f"文本:{text} → 预测情感:{sentiment}(标签:{pred})"
print("
===== 可用示例(极短简单文本) =====")
print(predict("电影 好看"))
print(predict("饭菜 难吃"))
print(predict("音乐 好听"))Training quickly reduces loss (e.g., Epoch 1000, Loss: 0.0007) and the model correctly classifies short sentences. Longer or contradictory sentences expose the RNN’s limitation.
Two‑Layer RNN for Text Generation
A deeper RNN can capture richer patterns. The example trains a two‑layer model on classical Chinese poems and generates new verses given a short prefix.
import torch
import torch.nn as nn
import torch.optim as optim
# Prepare character‑level data
text = """床前明月光,疑是地上霜。举头望明月,低头思故乡。..."""
text = text.replace("
", "").replace(" ", "")
chars = sorted(list(set(text)))
char2idx = {c:i for i,c in enumerate(chars)}
idx2char = {i:c for i,c in enumerate(chars)}
vocab_size = len(chars)
seq_len = 10
def build_data(text, seq_len):
data = []
for i in range(len(text)-seq_len):
input_seq = text[i:i+seq_len]
target_seq = text[i+1:i+seq_len+1]
x = torch.tensor([char2idx[c] for c in input_seq], dtype=torch.long)
y = torch.tensor([char2idx[c] for c in target_seq], dtype=torch.long)
x_onehot = torch.eye(vocab_size)[x]
data.append((x_onehot, y))
return data
train_data = build_data(text, seq_len)
class SimpleTwoLayerRNN(nn.Module):
def __init__(self, input_dim, hidden_dim1, hidden_dim2, output_dim):
super().__init__()
self.W_xh1 = nn.Parameter(torch.randn(hidden_dim1, input_dim) * 0.1)
self.W_hh1 = nn.Parameter(torch.randn(hidden_dim1, hidden_dim1) * 0.1)
self.b_h1 = nn.Parameter(torch.zeros(hidden_dim1, 1))
self.W_h1h2 = nn.Parameter(torch.randn(hidden_dim2, hidden_dim1) * 0.1)
self.W_hh2 = nn.Parameter(torch.randn(hidden_dim2, hidden_dim2) * 0.1)
self.b_h2 = nn.Parameter(torch.zeros(hidden_dim2, 1))
self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim2) * 0.1)
self.b_y = nn.Parameter(torch.zeros(output_dim, 1))
def forward(self, x_seq):
h1 = torch.zeros(self.W_hh1.shape[0], 1)
h2 = torch.zeros(self.W_hh2.shape[0], 1)
outputs = []
for t in range(x_seq.shape[0]):
x_t = x_seq[t].unsqueeze(1)
h1 = torch.tanh(self.W_hh1 @ h1 + self.W_xh1 @ x_t + self.b_h1)
h2 = torch.tanh(self.W_hh2 @ h2 + self.W_h1h2 @ h1 + self.b_h2)
y_t = self.W_hy @ h2 + self.b_y
y_t = torch.log_softmax(y_t.squeeze(1), dim=0)
outputs.append(y_t)
return torch.stack(outputs, dim=0)
hidden_dim1 = 24
hidden_dim2 = 12
model = SimpleTwoLayerRNN(vocab_size, hidden_dim1, hidden_dim2, vocab_size)
criterion = nn.NLLLoss()
optimizer = optim.SGD(model.parameters(), lr=0.015)
for epoch in range(500):
total_loss = 0.0
for x, y in train_data:
optimizer.zero_grad()
preds = model(x)
loss = criterion(preds, y)
loss.backward()
optimizer.step()
total_loss += loss.item()
if (epoch+1) % 100 == 0:
print(f"Epoch {epoch+1} | Avg Loss: {total_loss/len(train_data):.4f}")
def generate_text(prefix, max_len=20):
model.eval()
with torch.no_grad():
gen = list(prefix)
h1 = torch.zeros(hidden_dim1, 1)
h2 = torch.zeros(hidden_dim2, 1)
for c in prefix:
x_t = torch.eye(vocab_size)[char2idx[c]].unsqueeze(1)
h1 = torch.tanh(model.W_hh1 @ h1 + model.W_xh1 @ x_t + model.b_h1)
h2 = torch.tanh(model.W_hh2 @ h2 + model.W_h1h2 @ h1 + model.b_h2)
while len(gen) < max_len:
last = gen[-1]
x_t = torch.eye(vocab_size)[char2idx[last]].unsqueeze(1)
h1 = torch.tanh(model.W_hh1 @ h1 + model.W_xh1 @ x_t + model.b_h1)
h2 = torch.tanh(model.W_hh2 @ h2 + model.W_h1h2 @ h1 + model.b_h2)
y_t = model.W_hy @ h2 + model.b_y
next_idx = torch.argmax(y_t).item()
gen.append(idx2char[next_idx])
return "".join(gen)
print("
===== 文本生成结果(多前缀测试) =====")
for p in ["床前", "白日", "春眠", "千山"]:
print(f"前缀:{p} → 生成:{generate_text(p, max_len=20)}")The model learns character‑level patterns and can continue a given prefix with plausible poetic lines.
Input‑Output Structures in Sequence Modeling
Three common structures are described:
N‑N : equal‑length input and output sequences (e.g., part‑of‑speech tagging, language modeling).
1‑N : single input producing a longer output (e.g., image captioning, music generation).
N‑1 : multiple inputs collapsed into a single output (e.g., sequence classification, sentiment analysis).
Long‑Sequence Dependency Problems in RNNs
Even though RNNs are theoretically capable of learning arbitrarily long dependencies, in practice they suffer from gradient vanishing and exploding:
Gradient Vanishing
During back‑propagation, repeated multiplication by the derivative of tanh (which is <1) and the recurrent weight matrix W_hh causes gradients to shrink exponentially, making early time‑step information ineffective. The gradient ratio first_step_gradient / last_step_gradient often falls below 0.1 for long sequences.
Gradient Exploding
If W_hh has large eigenvalues, gradients can grow exponentially, leading to numerical overflow. Gradient clipping is a common remedy.
LSTM – Long Short‑Term Memory
Introduced in 1997, LSTM adds a memory cell C_t and three gates (forget, input, output) to regulate information flow, thereby mitigating vanishing gradients.
Gate Equations
f_t = σ(W_f * [h_{t-1}, x_t] + b_f) # Forget gate
i_t = σ(W_i * [h_{t-1}, x_t] + b_i) # Input gate
\tilde{C}_t = tanh(W_c * [h_{t-1}, x_t] + b_c) # Candidate cell
C_t = f_t * C_{t-1} + i_t * \tilde{C}_t # Cell update
o_t = σ(W_o * [h_{t-1}, x_t] + b_o) # Output gate
h_t = o_t * tanh(C_t) # Hidden stateThese mechanisms allow gradients to flow unchanged through the cell state when the forget gate is near 1.
PyTorch Implementation of a Single‑Layer LSTM
class SimpleLSTM(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.W_xi = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_hi = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_i = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_xf = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_hf = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_f = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_xo = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_ho = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_o = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_xc = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_hc = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_c = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim) * 0.1)
self.b_y = nn.Parameter(torch.zeros(output_dim, 1))
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.output_dim = output_dim
def forward(self, x_seq):
h_prev = torch.zeros(self.hidden_dim, 1)
c_prev = torch.zeros(self.hidden_dim, 1)
for t in range(x_seq.shape[0]):
x_t = x_seq[t].unsqueeze(1)
i_t = torch.sigmoid(self.W_xi @ x_t + self.W_hi @ h_prev + self.b_i)
f_t = torch.sigmoid(self.W_xf @ x_t + self.W_hf @ h_prev + self.b_f)
o_t = torch.sigmoid(self.W_xo @ x_t + self.W_ho @ h_prev + self.b_o)
c_tilde = torch.tanh(self.W_xc @ x_t + self.W_hc @ h_prev + self.b_c)
c_t = f_t * c_prev + i_t * c_tilde
h_t = o_t * torch.tanh(c_t)
h_prev, c_prev = h_t, c_t
y_final = self.W_hy @ h_prev + self.b_y
y_final = torch.log_softmax(y_final, dim=0)
return y_final, h_prevThe same sentiment‑analysis pipeline as the RNN example can be run with this LSTM class, yielding higher accuracy on longer or contradictory sentences.
GRU – Gated Recurrent Unit
Proposed in 2014, GRU merges the forget and input gates into an update gate and combines the cell and hidden state, resulting in fewer parameters.
Gate Equations
z_t = σ(W_z * [h_{t-1}, x_t] + b_z) # Update gate
r_t = σ(W_r * [h_{t-1}, x_t] + b_r) # Reset gate
\tilde{h}_t = tanh(W_h * [r_t * h_{t-1}, x_t] + b_h)
h_t = (1 - z_t) * h_{t-1} + z_t * \tilde{h}_tPyTorch Implementation of a Single‑Layer GRU
class SimpleGRU(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.W_xz = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_hz = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_z = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_xr = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_hr = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_r = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_xh = nn.Parameter(torch.randn(hidden_dim, input_dim) * 0.1)
self.W_hh = nn.Parameter(torch.randn(hidden_dim, hidden_dim) * 0.1)
self.b_h = nn.Parameter(torch.zeros(hidden_dim, 1))
self.W_hy = nn.Parameter(torch.randn(output_dim, hidden_dim) * 0.1)
self.b_y = nn.Parameter(torch.zeros(output_dim, 1))
self.input_dim = input_dim
self.hidden_dim = hidden_dim
self.output_dim = output_dim
def forward(self, x_seq):
h_prev = torch.zeros(self.hidden_dim, 1)
for t in range(x_seq.shape[0]):
x_t = x_seq[t].unsqueeze(1)
z_t = torch.sigmoid(self.W_xz @ x_t + self.W_hz @ h_prev + self.b_z)
r_t = torch.sigmoid(self.W_xr @ x_t + self.W_hr @ h_prev + self.b_r)
h_tilde = torch.tanh(self.W_xh @ x_t + self.W_hh @ (r_t * h_prev) + self.b_h)
h_t = (1 - z_t) * h_prev + z_t * h_tilde
h_prev = h_t
y_final = self.W_hy @ h_prev + self.b_y
y_final = torch.log_softmax(y_final, dim=0)
return y_final, h_prevWhen applied to the same sentiment task, GRU achieves performance comparable to LSTM with a simpler architecture.
LSTM vs. GRU – When to Use Which
LSTM excels on very long sequences or tasks requiring fine‑grained control over memory (e.g., long‑form translation, document‑level sentiment). GRU is faster and works well on short to medium‑length inputs such as tweets, real‑time chat, or small datasets.
RNN Encoder‑Decoder (Seq2Seq)
The encoder‑decoder framework uses two RNNs: the encoder compresses an input sequence of arbitrary length into a fixed‑size context vector C, and the decoder expands C into an output sequence, possibly of different length. This architecture underlies machine translation, summarization, and dialogue systems.
During decoding, the previous output token is fed back as input together with the hidden state, enabling auto‑regressive generation.
Seq2Seq and Its Generality
Google’s Seq2Seq model generalized the encoder‑decoder idea to any sequence‑to‑sequence task, not just translation. It treats tasks such as classification (output length = 1) or tagging (output length = input length) as special cases of Seq2Seq.
Limitations of RNN‑Based Seq2Seq
Before the Transformer, RNN‑based models suffered from two major drawbacks:
Inability to parallelize across time steps, leading to high training latency.
Difficulty handling very long‑range dependencies, causing performance to drop as input length grows.
These issues motivated the development of attention‑based architectures like the Transformer, which overcome both constraints.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
