How Collaborative Filtering Powers Recommendations: From Manhattan to Cosine Similarity

This article walks through the fundamentals of recommendation systems, explaining collaborative filtering and various similarity measures—including Manhattan, Euclidean, Minkowski, Pearson correlation, and cosine similarity—while discussing their suitability for dense, sparse, or biased rating data and introducing K‑Nearest Neighbors for practical implementation.

StarRing Big Data Open Lab
StarRing Big Data Open Lab
StarRing Big Data Open Lab
How Collaborative Filtering Powers Recommendations: From Manhattan to Cosine Similarity

We begin a data‑mining journey with recommendation systems, which appear everywhere such as Amazon’s “Customers who bought this also bought” section.

In the Amazon example two elements drive recommendations: the item you viewed and other users who also viewed that item.

How to Find Similar Users?

The first step is to locate users similar to the target user. Using a simple two‑dimensional model where users rate two books with five‑star scores, we illustrate similarity calculations.

Three users’ ratings for the two books are shown in the table below.

To recommend a book to a mysterious user X who gave four stars to one book and two stars to the other, we compute distances to other users.

Manhattan Distance

The Manhattan distance between two points (x1, y1) and (x2, y2) is the sum of absolute differences: |x1‑x2| + |y1‑y2|.

Using this metric, user Amy is the closest to X, so we recommend the book Amy rated highly.

Euclidean Distance

The Euclidean distance is the straight‑line distance between two points, computed via the Pythagorean theorem.

Applying this to the same users yields a distance of 4 between Amy and X.

Minkowski Distance

Both Manhattan and Euclidean are special cases of the Minkowski distance, where the parameter r determines the metric (r=1 for Manhattan, r=2 for Euclidean, r=∞ for Chebyshev).

Pearson Correlation Coefficient

Pearson correlation measures the linear relationship between two users’ rating vectors, ranging from –1 (perfect disagreement) to 1 (perfect agreement). It helps compare users whose rating scales differ.

An approximate formula allows single‑pass computation, useful for large datasets.

Cosine Similarity

Cosine similarity evaluates the angle between two rating vectors, handling sparse data well. It is computed as the dot product divided by the product of the vectors’ magnitudes.

Using the example of two perfectly aligned users yields a cosine similarity of 0.935, indicating a strong match.

K‑Nearest Neighbors

In collaborative filtering, the K‑Nearest Neighbors algorithm selects the K most similar users (based on Pearson, Manhattan, Euclidean, or cosine) to generate recommendations, weighting each neighbor by its similarity.

Finally, a small project suggestion encourages implementing distance calculations and applying a recommendation algorithm to a movie‑rating dataset.

machine learningdata miningcollaborative filteringrecommendation systemssimilarity metrics
StarRing Big Data Open Lab
Written by

StarRing Big Data Open Lab

Focused on big data technology research, exploring the Big Data era | [email protected]

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.