NLP Study Notes: How Word Vectors Capture Meaning
This article explains the evolution of natural language processing, introduces transformer‑based large models such as BERT, GPT and T5, and details how words are represented through one‑hot vectors and dense word embeddings, illustrating their training and analogy capabilities.
Natural Language Processing (NLP) is a core branch of artificial intelligence that aims to enable computers to understand, process, and generate human language, covering downstream tasks such as part‑of‑speech tagging, sentiment analysis, translation, speech recognition, named‑entity recognition, and summarization.
Early NLP relied on probabilistic and statistical methods, but recent breakthroughs come from deep‑learning models built on the Transformer architecture. The Transformer consists of an encoder that processes the input and a decoder that generates the output. Major model families derived from this architecture include encoder‑only BERT, decoder‑only GPT, and encoder‑decoder models like Google’s T5, all of which are considered “large models” because of their massive parameter counts (e.g., GPT‑3 has 175 billion parameters).
To feed text into these models, words (or characters) must first be tokenized. Tokenization tools such as NLTK for English or Jieba for Chinese split sentences into tokens, handle sub‑words (e.g., splitting “doing” into “do” and “##ing”), and assign a special “Other” token for out‑of‑vocabulary words.
Once tokenized, each token can be represented in two ways. The first is a one‑hot (sparse) vector: for a vocabulary of five words {apple, dog, do, this, cat}, “apple” is encoded as [1,0,0,0,0], “dog” as [0,1,0,0,0], etc. This representation wastes space and cannot express relationships between words.
The second method is a dense word embedding (a multi‑dimensional vector) learned from data. Dense embeddings capture semantic similarity: words like “tree” and “flower” obtain vectors that are close in Euclidean space, as illustrated by the visual example in the article.
In a language‑modeling task, the model predicts the next word by outputting a probability distribution over the vocabulary. The hidden‑layer weights (denoted z₁, z₂,…, zₙ) of the first layer serve as the word vectors. Observing these vectors shows that semantically related words cluster together.
Word embeddings also enable analogical reasoning through vector arithmetic, for example:
V(hotter) - V(hot) ≈ V(bigger) - V(big)</code><code>V(Rome) - V(Italy) ≈ V(Berlin) - V(Germany)</code><code>V(king) - V(queen) ≈ V(uncle) - V(aunt)When the model is asked “Rome is to Italy as Berlin is to ?”, it computes V(Berlin) - V(Rome) + V(Italy) and returns “Germany” with the highest probability, demonstrating the power of word vectors.
The two primary training algorithms for such embeddings are CBOW (Continuous Bag‑of‑Words) and Skip‑gram, both of which learn dense representations by predicting surrounding words from a target word (or vice‑versa).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Lisa Notes
Lisa's notes: musings on daily life, work, study, personal growth, and casual reflections.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
