Artificial Intelligence 21 min read

From Word2Vec to Quick-Thought: A Complete Guide to Modern Embeddings

This article reviews the evolution of word and sentence embeddings, covering foundational theories like vector semantics and distributional hypothesis, practical models such as Word2Vec, GloVe, fastText, Skip‑Thought, Quick‑Thought, and evaluation techniques, while offering implementation tips and real‑world use cases.

Alibaba Cloud Developer

Jun 18, 2019

From Word2Vec to Quick-Thought: A Complete Guide to Modern Embeddings

Embedding Evolution Overview

In 2013 Word2Vec sparked a wave of pre‑trained word embedding techniques in NLP, and six years later embedding methods have become standard components of neural NLP pipelines, evolving at multiple levels.

From a practical standpoint, fastText is often the default choice for word embeddings, while contextual embeddings such as ELMo, and later pre‑trained language models like ULMFiT and BERT, provide richer context.

Sentence embeddings have progressed from unsupervised models like Skip‑Thought (2015) to Quick‑Thought (2018), gaining increasing adoption in industry.

Vector Semantics

A perfect vector representation would capture all layers of word meaning, but this is unrealistic. The most successful models for representing word meaning are based on vector semantics, which consists of two parts:

Distributional hypothesis: words occurring in similar contexts tend to have similar meanings.

Defining a word as a point in an N‑dimensional semantic space learned directly from its distribution in text.

Using vector representations simplifies word similarity calculations.

Co‑occurrence Matrix and Basic Vector Semantics Models

Vector semantics models are usually built on co‑occurrence matrices, which implement the distributional hypothesis. Two main types of co‑occurrence matrices are used:

Term‑document matrix (common in information retrieval).

Word‑word matrix (common for word embeddings).

The term‑document matrix represents each word as a row and each document as a column, with cell values indicating word frequencies. TF‑IDF weighting mitigates the dominance of high‑frequency stopwords.

TF‑IDF (term frequency–inverse document frequency) is a fundamental weighting technique in NLP and information retrieval, emphasizing words that are frequent in a document but rare across the corpus.

The word‑word matrix captures how often a target word co‑occurs with context words within a sliding window. Pointwise Mutual Information (PMI) quantifies the strength of association between word pairs, and Positive PMI (PPMI) is commonly used in practice.

PPMI matrices are derived by converting raw co‑occurrence counts to joint probabilities and applying the PPMI formula, often with smoothing (α=0.75) to avoid inflated scores for rare words.

Because word‑word co‑occurrence matrices are high‑dimensional and sparse, Singular Value Decomposition (SVD) can be applied to obtain low‑dimensional dense vectors.

U, s, Vh = np.linalg.svd(X, full_matrices=False)

These dense vectors form the basis of modern word embeddings.

Word Embeddings

Short (50‑1000 dimensions) and dense word embeddings are preferred over long, sparse TF‑IDF or PPMI vectors. They are grounded in the distributional hypothesis, aiming to encode similarity directly in the vectors.

Two key terms:

Distributional representations: embody the distributional hypothesis.

Distributed representations: vectors where meaning is spread across dimensions, as opposed to one‑hot vectors.

Word2Vec

The classic Word2Vec framework (Mikolov et al., 2013) learns fixed‑length vectors for each vocabulary word by maximizing the conditional probability of context words given a center word. Training objectives are minimized via negative log‑likelihood, using cosine similarity and softmax (often approximated with negative sampling).

Practical tips include averaging the two vectors (center and context) after training, and choosing between Skip‑gram (better for rare words) and CBOW (faster training).

Other Improved Embeddings

GloVe combines count‑based and prediction‑based approaches, factorizing a weighted co‑occurrence matrix to capture global word statistics.

FastText extends Word2Vec by representing each word as a bag of character n‑grams, enabling fast training, improved performance, and OOV word handling.

Evaluating Embeddings

Evaluation can be intrinsic (e.g., analogy tests, visualization with PCA/t‑SNE, clustering) or extrinsic (downstream tasks such as tagging, parsing, text classification). Pre‑trained embeddings are especially useful when task‑specific data is scarce.

Sentence Embeddings / Encoder

Sentence embeddings can be built from bag‑of‑words models (e.g., TF‑IDF, SIF) or neural models. SIF (Smooth Inverse Frequency) weights word vectors by inverse frequency and removes the first principal component.

Skip‑Thought

Skip‑Thought trains an encoder‑decoder to predict the previous and next sentences, treating the task as a generative model.

Quick‑Thought

Quick‑Thought reformulates the Skip‑Thought objective as a classification problem: given a sentence, the decoder selects the correct previous or next sentence from a set of candidates, using cosine similarity and softmax.

Other Sentence Embeddings

Supervised approaches like InferSent (trained on NLI data) and multi‑task models like GenSen (trained on tasks such as Skip‑Thought, NMT, NLI, and parsing) produce universal sentence representations.

SemAxis: Synonym Expansion

SemAxis leverages pre‑trained word vectors to define a semantic axis from seed positive and negative words, then projects other words onto this axis to find synonyms or related terms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

NLP sentence embeddings word embeddings fastText Word2Vec GloVe vector semantics

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.