From Word2Vec to Quick-Thought: A Complete Guide to Modern Embeddings
This article reviews the evolution of word and sentence embeddings, covering foundational theories like vector semantics and distributional hypothesis, practical models such as Word2Vec, GloVe, fastText, Skip‑Thought, Quick‑Thought, and evaluation techniques, while offering implementation tips and real‑world use cases.
Embedding Evolution Overview
In 2013 Word2Vec sparked a wave of pre‑trained word embedding techniques in NLP, and six years later embedding methods have become standard components of neural NLP pipelines, evolving at multiple levels.
From a practical standpoint, fastText is often the default choice for word embeddings, while contextual embeddings such as ELMo, and later pre‑trained language models like ULMFiT and BERT, provide richer context.
Sentence embeddings have progressed from unsupervised models like Skip‑Thought (2015) to Quick‑Thought (2018), gaining increasing adoption in industry.
Vector Semantics
A perfect vector representation would capture all layers of word meaning, but this is unrealistic. The most successful models for representing word meaning are based on vector semantics, which consists of two parts:
Distributional hypothesis: words occurring in similar contexts tend to have similar meanings.
Defining a word as a point in an N‑dimensional semantic space learned directly from its distribution in text.
Using vector representations simplifies word similarity calculations.
Co‑occurrence Matrix and Basic Vector Semantics Models
Vector semantics models are usually built on co‑occurrence matrices, which implement the distributional hypothesis. Two main types of co‑occurrence matrices are used:
Term‑document matrix (common in information retrieval).
Word‑word matrix (common for word embeddings).
The term‑document matrix represents each word as a row and each document as a column, with cell values indicating word frequencies. TF‑IDF weighting mitigates the dominance of high‑frequency stopwords.
TF‑IDF (term frequency–inverse document frequency) is a fundamental weighting technique in NLP and information retrieval, emphasizing words that are frequent in a document but rare across the corpus.
The word‑word matrix captures how often a target word co‑occurs with context words within a sliding window. Pointwise Mutual Information (PMI) quantifies the strength of association between word pairs, and Positive PMI (PPMI) is commonly used in practice.
PPMI matrices are derived by converting raw co‑occurrence counts to joint probabilities and applying the PPMI formula, often with smoothing (α=0.75) to avoid inflated scores for rare words.
Because word‑word co‑occurrence matrices are high‑dimensional and sparse, Singular Value Decomposition (SVD) can be applied to obtain low‑dimensional dense vectors.
U, s, Vh = np.linalg.svd(X, full_matrices=False)These dense vectors form the basis of modern word embeddings.
Word Embeddings
Short (50‑1000 dimensions) and dense word embeddings are preferred over long, sparse TF‑IDF or PPMI vectors. They are grounded in the distributional hypothesis, aiming to encode similarity directly in the vectors.
Two key terms:
Distributional representations: embody the distributional hypothesis.
Distributed representations: vectors where meaning is spread across dimensions, as opposed to one‑hot vectors.
Word2Vec
The classic Word2Vec framework (Mikolov et al., 2013) learns fixed‑length vectors for each vocabulary word by maximizing the conditional probability of context words given a center word. Training objectives are minimized via negative log‑likelihood, using cosine similarity and softmax (often approximated with negative sampling).
Practical tips include averaging the two vectors (center and context) after training, and choosing between Skip‑gram (better for rare words) and CBOW (faster training).
Other Improved Embeddings
GloVe combines count‑based and prediction‑based approaches, factorizing a weighted co‑occurrence matrix to capture global word statistics.
FastText extends Word2Vec by representing each word as a bag of character n‑grams, enabling fast training, improved performance, and OOV word handling.
Evaluating Embeddings
Evaluation can be intrinsic (e.g., analogy tests, visualization with PCA/t‑SNE, clustering) or extrinsic (downstream tasks such as tagging, parsing, text classification). Pre‑trained embeddings are especially useful when task‑specific data is scarce.
Sentence Embeddings / Encoder
Sentence embeddings can be built from bag‑of‑words models (e.g., TF‑IDF, SIF) or neural models. SIF (Smooth Inverse Frequency) weights word vectors by inverse frequency and removes the first principal component.
Skip‑Thought
Skip‑Thought trains an encoder‑decoder to predict the previous and next sentences, treating the task as a generative model.
Quick‑Thought
Quick‑Thought reformulates the Skip‑Thought objective as a classification problem: given a sentence, the decoder selects the correct previous or next sentence from a set of candidates, using cosine similarity and softmax.
Other Sentence Embeddings
Supervised approaches like InferSent (trained on NLI data) and multi‑task models like GenSen (trained on tasks such as Skip‑Thought, NMT, NLI, and parsing) produce universal sentence representations.
SemAxis: Synonym Expansion
SemAxis leverages pre‑trained word vectors to define a semantic axis from seed positive and negative words, then projects other words onto this axis to find synonyms or related terms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
