User-Based Collaborative Filtering with Python: A Step-by-Step Guide
This article explains how to implement a user‑based collaborative filtering recommendation system in Python, covering data loading, preprocessing, cosine‑similarity computation, neighbor selection, rating prediction, and generating top‑5 movie recommendations with detailed code examples.
Collaborative filtering is a widely used technique in recommendation systems, typically divided into memory‑based (user‑based or item‑based) and model‑based approaches.
The tutorial focuses on a memory‑based user‑based method, where similarity between users is measured using cosine similarity on a rating matrix that is first mean‑centered and cleaned.
Data is loaded with pandas from movies.csv , ratings.csv and tags.csv . User mean ratings are computed and merged to obtain an adjusted rating column ( adg_rating ).
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances
movies = pd.read_csv("data/movies.csv")
Ratings = pd.read_csv("data/ratings.csv")
Tags = pd.read_csv("data/tags.csv")Missing values in the user‑item matrix are filled either by the average rating of each movie or by each user’s average rating:
# Replacing NaN by Movie Average
final_movie = final.fillna(final.mean(axis=0))
# Replacing NaN by User Average
final_user = final.apply(lambda row: row.fillna(row.mean()), axis=1)Cosine similarity is then computed for both filled matrices, with the diagonal set to zero to ignore self‑similarity:
# user similarity on replacing NAN by item(movie) avg
cosine = cosine_similarity(final_movie)
np.fill_diagonal(cosine, 0)
similarity_with_movie = pd.DataFrame(cosine, index=final_movie.index)
similarity_with_movie.columns = final_user.index
# user similarity on replacing NAN by user avg
b = cosine_similarity(final_user)
np.fill_diagonal(b, 0)
similarity_with_user = pd.DataFrame(b, index=final_user.index)
similarity_with_user.columns = final_user.indexFor each user, the top 30 most similar users (or movies) are selected as neighbours:
def find_n_neighbours(df, n):
order = np.argsort(df.values, axis=1)[:, :n]
df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
.iloc[:n].index,
index=[f'top{i}' for i in range(1, n+1)]), axis=1)
return df
sim_user_30_u = find_n_neighbours(similarity_with_user, 30)
sim_user_30_m = find_n_neighbours(similarity_with_movie, 30)Rating prediction for a specific user‑item pair uses a weighted sum of neighbour ratings multiplied by similarity weights, added to the user’s average rating:
def User_item_score(user, item):
a = sim_user_30_m[sim_user_30_m.index==user].values
b = a.squeeze().tolist()
c = final_movie.loc[:, item]
d = c[c.index.isin(b)]
f = d[d.notnull()]
avg_user = Mean.loc[Mean['userId'] == user, 'rating'].values[0]
index = f.index.values.squeeze().tolist()
corr = similarity_with_movie.loc[user, index]
fin = pd.concat([f, corr], axis=1)
fin.columns = ['adg_score', 'correlation']
fin['score'] = fin.apply(lambda x: x['adg_score'] * x['correlation'], axis=1)
final_score = avg_user + (fin['score'].sum() / fin['correlation'].sum())
return final_score
score = User_item_score(320, 7371)
print("score (u,i) is", score)Finally, the top‑5 recommended movies for a given user are obtained by scoring all unseen items and sorting by the predicted score:
def User_item_score1(user):
Movie_seen_by_user = check.columns[check.loc[user].notna()].tolist()
b = sim_user_30_m.loc[user].values.squeeze().tolist()
d = Movie_user[Movie_user.index.isin(b)]
l = ','.join(d.values)
Movies_under_consideration = list(set(l.split(',')) - set(map(str, Movie_seen_by_user)))
Movies_under_consideration = list(map(int, Movies_under_consideration))
score = []
for item in Movies_under_consideration:
c = final_movie.loc[:, item]
d = c[c.index.isin(b)]
f = d[d.notnull()]
avg_user = Mean.loc[Mean['userId'] == user, 'rating'].values[0]
index = f.index.values.squeeze().tolist()
corr = similarity_with_movie.loc[user, index]
fin = pd.concat([f, corr], axis=1)
fin.columns = ['adg_score', 'correlation']
fin['score'] = fin.apply(lambda x: x['adg_score'] * x['correlation'], axis=1)
final_score = avg_user + (fin['score'].sum() / fin['correlation'].sum())
score.append(final_score)
data = pd.DataFrame({'movieId': Movies_under_consideration, 'score': score})
top_5 = data.sort_values(by='score', ascending=False).head(5)
Movie_Name = top_5.merge(movies, how='inner', on='movieId')
return Movie_Name.title.values.tolist()
user = int(input("Enter the user id to whom you want to recommend : "))
predicted_movies = User_item_score1(user)
print("The Recommendations for User Id :", user)
for i in predicted_movies:
print(i)The article concludes with a concise workflow: data collection, loading, preprocessing, similarity computation, score prediction, and item recommendation, emphasizing its applicability to large‑scale data scenarios.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.