User-Based Collaborative Filtering Recommendation Algorithm Explained
This article introduces the concept and history of recommendation algorithms, outlines the basic conditions for recommendations, and provides a detailed explanation of user-based collaborative filtering, including similarity calculations, neighbor selection, recommendation scoring, practical code snippets, and discussion of potential issues.
Recommendation algorithms were first proposed in 1992 but gained popularity with the explosion of internet data, enabling personalized suggestions when users cannot explicitly state their preferences.
The basic conditions for recommendations include collaborative filtering based on similar users, item similarity, keyword matching, and hybrid approaches.
User-based collaborative filtering treats users as the primary entity, recommending items liked by users with similar tastes, as opposed to item-based methods that focus on item similarity.
Similarity Calculation – The article mentions classic similarity measures such as Jaccard index and cosine similarity, ultimately using cosine distance to compute user‑user similarity.
Finding the K Nearest Neighbors – By comparing the target user with all others, the algorithm selects the K most similar users (the "good friends") to form the recommendation pool, often using an inverted index (item‑to‑user table) to prune irrelevant users.
Scoring Recommendations – Each neighbor’s similarity weight is multiplied by the presence of an item in their profile, aggregating scores across neighbors to rank items (e.g., soap receives the highest score in the example).
Algorithm Summary
Compute similarity between the target user and others, optionally using an inverted index to ignore unrelated users.
Select the top‑K most similar neighbors.
For each item liked by these neighbors, calculate a recommendation score weighted by similarity.
Rank items by score and present the top recommendations.
The article also discusses practical issues such as popular items dominating recommendations and the need to filter or re‑weight such "dirty data" during preprocessing.
Practical Example – Using the MovieLens dataset, the author demonstrates how to preprocess ratings (treating scores >3 as likes), build user and item dictionaries, find neighbors, and generate a recommendation list. The following code snippet illustrates the core steps:
#读取文件数据
test_contents=readFile(file_name)
#文件数据格式化成二维数组 List[[用户id,电影id,电影评分]...]
test_rates=getRatingInformation(test_contents)
#格式化成字典数据
# 1.用户字典:dic[用户id]=[(电影id,电影评分)...]
# 2.电影用户反查表:dic[电影id]=[用户id1,用户id2...]
test_dic,test_item_to_user=createUserRankDic(test_rates)
#寻找邻居
neighbors=calcNearestNeighbor(userid,test_dic,test_item_to_user)[:k]
#计算推荐列表
recommend_dic={}
for neighbor in neighbors:
neighbor_user_id=neighbor[1]
movies=test_dic[neighbor_user_id]
for movie in movies:
if movie[0] not in recommend_dic:
recommend_dic[movie[0]]=neighbor[0]
else:
recommend_dic[movie[0]]+=neighbor[0]
#建立推荐列表
recommend_list=[]
for key in recommend_dic:
recommend_list.append([recommend_dic[key],key]
recommend_list.sort(reverse=True)Running this pipeline on a sample user yields a list of recommended movies (e.g., "Titanic", "Star Wars", etc.), illustrating how popular titles can appear as recommendations unless filtered out.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.