NLP Basics: Word Embeddings, Word2Vec, and Hand‑crafted RNN Implementation in PyTorch
This article introduces word‑level representations—from one‑hot encoding to dense word embeddings via Word2Vec—explains cosine similarity, then walks through the structure, limitations, and PyTorch implementation of a vanilla RNN, including a custom forward function and verification against the library API.
Preface
Hello everyone, I’m Xiao Su. In this series I will explore Natural Language Processing (NLP) from a beginner’s perspective, covering essential concepts such as word vectors, RNNs, LSTM/ELMo, and finally GPT/BERT.
Word Vectors
In NLP we must convert words into numeric forms that computers can process. One‑hot encoding represents each word as a high‑dimensional sparse vector, which wastes space and cannot capture relationships between words (cosine similarity is zero).
Effective word representations should be dense, low‑dimensional, and allow similarity computation. Word2Vec learns such embeddings by training a shallow neural network (CBOW or Skip‑gram) that produces an embedding matrix Q.
Example: the 50‑dimensional vector for the word "king" is shown, and similar vectors for "man" and "woman" illustrate semantic proximity. Cosine similarity is introduced as a measure of vector similarity.
Word Embedding can be visualized by mapping vector values to colors, demonstrating that related words occupy nearby regions in the embedding space.
RNN Model
Recurrent Neural Networks (RNNs) handle sequential data such as text. The basic RNN cell consists of a tanh layer that processes the current input and the previous hidden state.
RNNs suffer from the long‑distance dependency problem: they struggle to capture relationships between tokens that are far apart in the sequence.
Hand‑crafted RNN in PyTorch
First, the built‑in nn.RNN API is demonstrated:
import torch
import torch.nn as nn
bs, T = 2, 3 # batch size, sequence length
input_size, hidden_size = 2, 3
input = torch.randn(bs, T, input_size)
h_prev = torch.zeros(bs, hidden_size)
rnn = nn.RNN(input_size, hidden_size, batch_first=True)
rnn_output, state_final = rnn(input, h_prev.unsqueeze(0))Next, a custom forward function rnn_forward is provided, which manually performs the matrix multiplications and tanh activation for each time step:
def rnn_forward(input, weight_ih, weight_hh, bias_ih, bias_hh, h_prev):
bs, T, input_size = input.shape
h_dim = weight_ih.shape[0]
h_out = torch.zeros(bs, T, h_dim)
for t in range(T):
x = input[:, t, :].unsqueeze(2)
w_ih_batch = weight_ih.unsqueeze(0).tile(bs, 1, 1)
w_hh_batch = weight_hh.unsqueeze(0).tile(bs, 1, 1)
w_times_x = torch.bmm(x.transpose(1, 2), w_ih_batch.transpose(1, 2)).transpose(1, 2).squeeze(-1)
w_times_h = torch.bmm(h_prev.unsqueeze(2).transpose(1, 2), w_hh_batch.transpose(1, 2)).transpose(1, 2).squeeze(-1)
h_prev = torch.tanh(w_times_x + bias_ih + w_times_h + bias_hh)
h_out[:, t, :] = h_prev
return h_out, h_prev.unsqueeze(0)The custom implementation is verified by feeding the same parameters extracted from the built‑in RNN (weights and biases) and confirming that custom_rnn_output and custom_state_final match the library results.
References
1. The Illustrated Word2vec 2. Understanding LSTM Networks 3. Transformer notes: from Word2Vec & Seq2Seq to GPT & BERT 4. Understanding LSTM Networks (English) 5. The evolution of pre‑trained language models 6. PyTorch source tutorials and cutting‑edge AI algorithm reproductions
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.