Artificial Intelligence 18 min read

Understanding Cosine Similarity: From Mathematical Foundations to Practical Applications

The article explains cosine similarity from basic geometry and vector math, derives its formula, and shows how it powers precision marketing, image classification, and text retrieval, while also detailing its industrial implementation in Lucene’s vector space model.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Understanding Cosine Similarity: From Mathematical Foundations to Practical Applications

Most programmers with STEM backgrounds possess foundational knowledge in mathematics, including calculus, linear algebra, and probability theory. When the machine learning boom arrived, many were eager to explore machine learning algorithms and their underlying mathematical principles. However, in practice, some find their mathematical understanding insufficient to fully grasp the meaning behind formulas. This article addresses these knowledge gaps by focusing on cosine similarity.

Cosine similarity calculation has extensive applications and serves as a core component in search engines, recommendation systems, and classification/clustering scenarios. To understand cosine similarity thoroughly, the author starts from basic middle school mathematics and gradually derives the cosine formula, then demonstrates practical examples based on the formula.

Business Background: Three seemingly unrelated business scenarios—precision marketing, image processing, and search engines—share a common challenge: similarity calculation. Whether it's user similarity in crowd expansion for precision marketing, image similarity in image classification, or query-document similarity in search engines, cosine similarity is the most familiar approach.

Mathematical Foundation:

1. Pythagorean Theorem: To ensure a constructed quadrilateral is a square, both equal side lengths and right angles are required. The ancient solution used the converse of the Pythagorean theorem—constructing a triangle with sides 3, 4, and 5, where the side with length 5 forms a right angle.

2. Cosine Theorem: The Pythagorean theorem only applies to right triangles. For general triangles, the cosine theorem describes the relationship between three sides.

3. Cosine Similarity: By introducing Cartesian coordinates, triangles can be represented more flexibly through vectors. A vector's length is essentially the Pythagorean theorem extended to N-dimensional space. Combining the Pythagorean theorem, cosine theorem, Cartesian coordinates, and vectors leads naturally to the cosine formula.

Business Practice:

Case 1: Precision Marketing: The core problem is how to vectorize users. Each user is represented as a vector, with each tag value as a dimension. For a user group, the average of all users' dimension values becomes the group vector. By calculating cosine similarity between each user in the broader population and the target group, top-N similar users can be selected for crowd expansion.

Case 2: Image Classification: The core problem is how to vectorize images. Images consist of pixels with RGB channels. Images are divided into grids, with each grid having 3 dimensions representing the mean RGB values of pixels within that grid.

Case 3: Text Retrieval: The core problem is how to vectorize text and search queries. Each word becomes a dimension, and the word's frequency in the document becomes the dimension value. After vectorization, cosine similarity calculates relevance between query and documents.

Beyond Cosine Similarity: In industrial systems like Lucene (Elasticsearch's core), cosine similarity is implemented through the Vector Space Model using TF-IDF features. The practical formula involves four steps: calculating vector multiplication, computing query vector length (queryNorm), computing document vector length with length normalization, and incorporating user weights and scoring factors.

Summary: This article introduces the mathematical background of cosine similarity, starting from Egyptian pyramid construction through the Pythagorean theorem to the cosine theorem, ultimately deriving the cosine formula through vector mathematics. Three business scenarios demonstrate how mathematical models translate into practical applications, followed by an industrial-grade example showing how cosine similarity is implemented in Lucene.

machine learningSearch EngineLuceneTF-IDFcosine similaritysimilarity calculationVector Space Model
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.