How Collaborative Filtering Powers Recommendations: From Manhattan to Cosine Similarity
This article walks through the fundamentals of recommendation systems, explaining collaborative filtering and various similarity measures—including Manhattan, Euclidean, Minkowski, Pearson correlation, and cosine similarity—while discussing their suitability for dense, sparse, or biased rating data and introducing K‑Nearest Neighbors for practical implementation.
We begin a data‑mining journey with recommendation systems, which appear everywhere such as Amazon’s “Customers who bought this also bought” section.
In the Amazon example two elements drive recommendations: the item you viewed and other users who also viewed that item.
How to Find Similar Users?
The first step is to locate users similar to the target user. Using a simple two‑dimensional model where users rate two books with five‑star scores, we illustrate similarity calculations.
Three users’ ratings for the two books are shown in the table below.
To recommend a book to a mysterious user X who gave four stars to one book and two stars to the other, we compute distances to other users.
Manhattan Distance
The Manhattan distance between two points (x1, y1) and (x2, y2) is the sum of absolute differences: |x1‑x2| + |y1‑y2|.
Using this metric, user Amy is the closest to X, so we recommend the book Amy rated highly.
Euclidean Distance
The Euclidean distance is the straight‑line distance between two points, computed via the Pythagorean theorem.
Applying this to the same users yields a distance of 4 between Amy and X.
Minkowski Distance
Both Manhattan and Euclidean are special cases of the Minkowski distance, where the parameter r determines the metric (r=1 for Manhattan, r=2 for Euclidean, r=∞ for Chebyshev).
Pearson Correlation Coefficient
Pearson correlation measures the linear relationship between two users’ rating vectors, ranging from –1 (perfect disagreement) to 1 (perfect agreement). It helps compare users whose rating scales differ.
An approximate formula allows single‑pass computation, useful for large datasets.
Cosine Similarity
Cosine similarity evaluates the angle between two rating vectors, handling sparse data well. It is computed as the dot product divided by the product of the vectors’ magnitudes.
Using the example of two perfectly aligned users yields a cosine similarity of 0.935, indicating a strong match.
K‑Nearest Neighbors
In collaborative filtering, the K‑Nearest Neighbors algorithm selects the K most similar users (based on Pearson, Manhattan, Euclidean, or cosine) to generate recommendations, weighting each neighbor by its similarity.
Finally, a small project suggestion encourages implementing distance calculations and applying a recommendation algorithm to a movie‑rating dataset.
StarRing Big Data Open Lab
Focused on big data technology research, exploring the Big Data era | [email protected]
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
