Build a Movie Recommendation System with Pearson Correlation in Python

This article demonstrates a Python-based movie recommendation approach that crawls Douban user data, categorizes ratings, computes Pearson correlation to identify like‑minded users, and generates weighted movie suggestions, complete with code snippets for data handling, similarity calculation, and recommendation generation.

21CTO
21CTO
21CTO
Build a Movie Recommendation System with Pearson Correlation in Python

Process

Use a web crawler to fetch Douban movie user information, define multi‑level rating categories, compute Pearson correlation between users, and analyze similarity either from a user‑centric or item‑centric perspective.

Movie rating classification

Very Bad

Bad

Average

Recommended

Strongly Recommended

Code

Data entry and storage:

# -*- coding: utf-8 -*-
import json
import sys
reload(sys)
sys.setdefaultencoding("utf-8")
user_info = {}
# Sample user data
user_dict = {
    'ns2250225':[4,3,4,5,4],
    'justin':[3,4,3,4,2],
    'totox':[2,3,5,1,4],
    'fabrice':[4,1,3,4,5],
    'doreen':[3,4,2,5,3]
}
def user_data(user_dict):
    for name in user_dict:
        user_info[name] = {}
        user_info[name][u'消失的爱人'] = user_dict[name][0]
        user_info[name][u'霍比特人3'] = user_dict[name][1]
        user_info[name][u'神去村'] = user_dict[name][2]
        user_info[name][u'泰坦尼克号'] = user_dict[name][3]
        user_info[name][u'这个杀手不太冷'] = user_dict[name][4]
user_data(user_dict)
# Save to file
try:
    with open('user_data.txt','w') as data:
        for key in user_info:
            data.write(key)
            for key2 in user_info[key]:
                data.write('\t')
                data.write(key2)
                data.write('\t')
                data.write(str(user_info[key][key2]))
                data.write('
')
except IOError as err:
    print('File error: ' + str(err))

Compute Pearson correlation and find similar users

from math import sqrt
def sim_pearson(prefs, p1, p2):
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:
            si[item] = 1
    if len(si) == 0:
        return 0
    n = len(si)
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
    num = pSum - (sum1 * sum2 / n)
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:
        return 0
    return num / den
# Insert my own data
user_info['me'] = {
    u'消失的爱人':5,
    u'神去村':3,
    u'炸裂鼓手':5
}
for user in user_info:
    res = sim_pearson(user_info, 'me', user)
    if res > 0:
        print('the user like %s is : %s' % ('me', user))
        print('result :%f' % res)

Recommend movies (weighted average)

def getRecommendations(prefs, person, similarity=sim_pearson):
    totals = {}
    simSums = {}
    for other in prefs:
        if other == person:
            continue
        sim = similarity(prefs, person, other)
        if sim <= 0:
            continue
        for item in prefs[other]:
            if item not in prefs[person] or prefs[person][item] == 0:
                totals.setdefault(item, 0)
                totals[item] += prefs[other][item] * sim
                simSums.setdefault(item, 0)
                simSums[item] += sim
    rankings = [(total / simSums[item], item) for item, total in totals.items()]
    rankings.sort()
    rankings.reverse()
    return rankings
res = getRecommendations(user_info, "me")
print('Recommend watching the movie:')
print(json.dumps(res, ensure_ascii=False))

Results and analysis

Users with similar taste: doreen, fabrice

Recommended movies for me: Titanic, Léon: The Professional

From a user‑centric view, finding like‑minded people helps discover potential interests.

From an item‑centric view, finding similar items helps identify potential customers.

Source: Data Mining: Introduction and Practice
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Pythondata miningPearson Correlation
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.