Artificial Intelligence 12 min read

User-Based Collaborative Filtering with Python: A Step-by-Step Guide

This article explains how to implement a user‑based collaborative filtering recommendation system in Python, covering data loading, preprocessing, cosine‑similarity computation, neighbor selection, rating prediction, and generating top‑5 movie recommendations with detailed code examples.

DataFunTalk
DataFunTalk
DataFunTalk
User-Based Collaborative Filtering with Python: A Step-by-Step Guide

Collaborative filtering is a widely used technique in recommendation systems, typically divided into memory‑based (user‑based or item‑based) and model‑based approaches.

The tutorial focuses on a memory‑based user‑based method, where similarity between users is measured using cosine similarity on a rating matrix that is first mean‑centered and cleaned.

Data is loaded with pandas from movies.csv , ratings.csv and tags.csv . User mean ratings are computed and merged to obtain an adjusted rating column ( adg_rating ).

import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import pairwise_distances

movies = pd.read_csv("data/movies.csv")
Ratings = pd.read_csv("data/ratings.csv")
Tags = pd.read_csv("data/tags.csv")

Missing values in the user‑item matrix are filled either by the average rating of each movie or by each user’s average rating:

# Replacing NaN by Movie Average
final_movie = final.fillna(final.mean(axis=0))
# Replacing NaN by User Average
final_user = final.apply(lambda row: row.fillna(row.mean()), axis=1)

Cosine similarity is then computed for both filled matrices, with the diagonal set to zero to ignore self‑similarity:

# user similarity on replacing NAN by item(movie) avg
cosine = cosine_similarity(final_movie)
np.fill_diagonal(cosine, 0)
similarity_with_movie = pd.DataFrame(cosine, index=final_movie.index)
similarity_with_movie.columns = final_user.index

# user similarity on replacing NAN by user avg
b = cosine_similarity(final_user)
np.fill_diagonal(b, 0)
similarity_with_user = pd.DataFrame(b, index=final_user.index)
similarity_with_user.columns = final_user.index

For each user, the top 30 most similar users (or movies) are selected as neighbours:

def find_n_neighbours(df, n):
    order = np.argsort(df.values, axis=1)[:, :n]
    df = df.apply(lambda x: pd.Series(x.sort_values(ascending=False)
           .iloc[:n].index,
          index=[f'top{i}' for i in range(1, n+1)]), axis=1)
    return df

sim_user_30_u = find_n_neighbours(similarity_with_user, 30)
sim_user_30_m = find_n_neighbours(similarity_with_movie, 30)

Rating prediction for a specific user‑item pair uses a weighted sum of neighbour ratings multiplied by similarity weights, added to the user’s average rating:

def User_item_score(user, item):
    a = sim_user_30_m[sim_user_30_m.index==user].values
    b = a.squeeze().tolist()
    c = final_movie.loc[:, item]
    d = c[c.index.isin(b)]
    f = d[d.notnull()]
    avg_user = Mean.loc[Mean['userId'] == user, 'rating'].values[0]
    index = f.index.values.squeeze().tolist()
    corr = similarity_with_movie.loc[user, index]
    fin = pd.concat([f, corr], axis=1)
    fin.columns = ['adg_score', 'correlation']
    fin['score'] = fin.apply(lambda x: x['adg_score'] * x['correlation'], axis=1)
    final_score = avg_user + (fin['score'].sum() / fin['correlation'].sum())
    return final_score

score = User_item_score(320, 7371)
print("score (u,i) is", score)

Finally, the top‑5 recommended movies for a given user are obtained by scoring all unseen items and sorting by the predicted score:

def User_item_score1(user):
    Movie_seen_by_user = check.columns[check.loc[user].notna()].tolist()
    b = sim_user_30_m.loc[user].values.squeeze().tolist()
    d = Movie_user[Movie_user.index.isin(b)]
    l = ','.join(d.values)
    Movies_under_consideration = list(set(l.split(',')) - set(map(str, Movie_seen_by_user)))
    Movies_under_consideration = list(map(int, Movies_under_consideration))
    score = []
    for item in Movies_under_consideration:
        c = final_movie.loc[:, item]
        d = c[c.index.isin(b)]
        f = d[d.notnull()]
        avg_user = Mean.loc[Mean['userId'] == user, 'rating'].values[0]
        index = f.index.values.squeeze().tolist()
        corr = similarity_with_movie.loc[user, index]
        fin = pd.concat([f, corr], axis=1)
        fin.columns = ['adg_score', 'correlation']
        fin['score'] = fin.apply(lambda x: x['adg_score'] * x['correlation'], axis=1)
        final_score = avg_user + (fin['score'].sum() / fin['correlation'].sum())
        score.append(final_score)
    data = pd.DataFrame({'movieId': Movies_under_consideration, 'score': score})
    top_5 = data.sort_values(by='score', ascending=False).head(5)
    Movie_Name = top_5.merge(movies, how='inner', on='movieId')
    return Movie_Name.title.values.tolist()

user = int(input("Enter the user id to whom you want to recommend : "))
predicted_movies = User_item_score1(user)
print("The Recommendations for User Id :", user)
for i in predicted_movies:
    print(i)

The article concludes with a concise workflow: data collection, loading, preprocessing, similarity computation, score prediction, and item recommendation, emphasizing its applicability to large‑scale data scenarios.

Machine Learningpythonrecommendation systemcollaborative filteringcosine similarityuser-based
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.