Build a Movie Recommendation System with Pearson Correlation in Python
This article demonstrates a Python-based movie recommendation approach that crawls Douban user data, categorizes ratings, computes Pearson correlation to identify like‑minded users, and generates weighted movie suggestions, complete with code snippets for data handling, similarity calculation, and recommendation generation.
Process
Use a web crawler to fetch Douban movie user information, define multi‑level rating categories, compute Pearson correlation between users, and analyze similarity either from a user‑centric or item‑centric perspective.
Movie rating classification
Very Bad
Bad
Average
Recommended
Strongly Recommended
Code
Data entry and storage:
# -*- coding: utf-8 -*-
import json
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
user_info = {}
# Sample user data
user_dict = {
'ns2250225':[4,3,4,5,4],
'justin':[3,4,3,4,2],
'totox':[2,3,5,1,4],
'fabrice':[4,1,3,4,5],
'doreen':[3,4,2,5,3]
}
def user_data(user_dict):
for name in user_dict:
user_info[name] = {}
user_info[name][u'消失的爱人'] = user_dict[name][0]
user_info[name][u'霍比特人3'] = user_dict[name][1]
user_info[name][u'神去村'] = user_dict[name][2]
user_info[name][u'泰坦尼克号'] = user_dict[name][3]
user_info[name][u'这个杀手不太冷'] = user_dict[name][4]
user_data(user_dict)
# Save to file
try:
with open('user_data.txt','w') as data:
for key in user_info:
data.write(key)
for key2 in user_info[key]:
data.write('\t')
data.write(key2)
data.write('\t')
data.write(str(user_info[key][key2]))
data.write('
')
except IOError as err:
print('File error: ' + str(err))Compute Pearson correlation and find similar users
from math import sqrt
def sim_pearson(prefs, p1, p2):
si = {}
for item in prefs[p1]:
if item in prefs[p2]:
si[item] = 1
if len(si) == 0:
return 0
n = len(si)
sum1 = sum([prefs[p1][it] for it in si])
sum2 = sum([prefs[p2][it] for it in si])
sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
num = pSum - (sum1 * sum2 / n)
den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
if den == 0:
return 0
return num / den
# Insert my own data
user_info['me'] = {
u'消失的爱人':5,
u'神去村':3,
u'炸裂鼓手':5
}
for user in user_info:
res = sim_pearson(user_info, 'me', user)
if res > 0:
print('the user like %s is : %s' % ('me', user))
print('result :%f' % res)Recommend movies (weighted average)
def getRecommendations(prefs, person, similarity=sim_pearson):
totals = {}
simSums = {}
for other in prefs:
if other == person:
continue
sim = similarity(prefs, person, other)
if sim <= 0:
continue
for item in prefs[other]:
if item not in prefs[person] or prefs[person][item] == 0:
totals.setdefault(item, 0)
totals[item] += prefs[other][item] * sim
simSums.setdefault(item, 0)
simSums[item] += sim
rankings = [(total / simSums[item], item) for item, total in totals.items()]
rankings.sort()
rankings.reverse()
return rankings
res = getRecommendations(user_info, "me")
print('Recommend watching the movie:')
print(json.dumps(res, ensure_ascii=False))Results and analysis
Users with similar taste: doreen, fabrice
Recommended movies for me: Titanic, Léon: The Professional
From a user‑centric view, finding like‑minded people helps discover potential interests.
From an item‑centric view, finding similar items helps identify potential customers.
Source: Data Mining: Introduction and Practice
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
