Building a Vector‑Based Movie Recommendation System with Transformers
This tutorial walks through constructing a movie recommendation engine by downloading a dataset, cleaning and de‑duplicating entries, encoding plot summaries into vectors with transformer models, and performing nearest‑neighbor searches using scikit‑learn, while handling misspellings with Levenshtein distance.
The article demonstrates how to build a simplified movie recommendation system that relies on vector search and transformer‑based embeddings.
Step 1 – Install required libraries
!pip install -U relevanceai
!pip install tqdm>=4.62.2Step 2 – Download the MPST movie‑plot dataset from Kaggle and load it with pandas.
import pandas as pd
df = pd.read_csv('mpst_full_data.csv')Step 3 – Preprocess the dataset
Sort titles alphabetically and compute Levenshtein distance between consecutive titles to identify near‑duplicates (distance ≤ 3).
Filter out the duplicates and reset the index.
df = df.sort_values('title').reset_index(drop=True)
df['lev'] = None
from Levenshtein import distance
for a in range(len(df)-1):
if distance(df.iloc[a].title, df.iloc[a+1].title) <= 3:
df.at[a, 'lev'] = distance(df.iloc[a].title, df.iloc[a+1].title)
df = df[df['lev'].isnull()].reset_index(drop=True)
# Manual check for famous titles, e.g., remove duplicate "Avengers"
df = df.drop([9572]).reset_index(drop=True)
df.to_csv('mpst_no_duplicates.csv')Step 4 – Encode plot synopses into vectors
First, create a list of dictionaries containing the title and plot:
json_files = df[['title', 'plot_synopsis']]
json_files = json_files.reset_index()
json_files.columns = ['_id', 'title', 'plot_synopsis']
json_files = json_files.to_dict(orient='records')Encode with the relevanceai API using a BERT model:
from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
model = SentenceTransformer2Vec('bert-base-uncased')
df_json = model.encode_documents(documents=json_files, fields=['plot_synopsis'])
# The new vector field is named plot_synopsis_sentence_transformers_vector_Alternatively, encode locally with sentence_transformers and tqdm for progress tracking:
from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
df_ = df.copy()
df_['plot_synopsis'] = df_['plot_synopsis'].progress_apply(lambda x: model.encode(x))
df_index = df_.pop('title')
df_ = pd.DataFrame(np.column_stack(list(zip(*df_.values))))
df_.index = df_index
df_.to_csv('mpst_encoded_no_duplicates.csv')Step 5 – Perform vector search with scikit‑learn
import pandas as pd
from sklearn.neighbors import NearestNeighbors
df_movies_encoded = pd.read_csv('mpst_encoded_no_duplicates.csv')
df_movies_encoded.index = df_movies_encoded.pop('title')
nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(df_movies_encoded)Because the dataset may lack an exact title or contain typos, a helper function uses Levenshtein distance to find the closest existing title before querying the nearest‑neighbor model:
from Levenshtein import distance
def closest_title(title):
m = pd.DataFrame(df_movies_encoded.index, columns=['title'])
m['lev'] = m['title'].apply(lambda x: distance(x, title))
return m.sort_values('lev').iloc[0]['title']
def find_similar_movies(df, nbrs, title):
title = closest_title(title)
distances, indices = nbrs.kneighbors([df.loc[title]])
for idx in indices[0][1:]:
print('title', title, '->', df.iloc[idx].name)Example usage:
find_similar_movies(df_movies_encoded, nbrs, 'Prince of Egypt')
# Output: The Prince of Egypt -> The Ten Commandments: The MusicalThe system returns the five nearest movies (the example uses two neighbors for brevity). The article notes that searching for a title not present in the dataset, such as "The Avengers" when only "Avengers" exists, will fail unless the Levenshtein‑based fallback is applied.
Conclusion
The guide shows a complete pipeline—from data acquisition and cleaning, through transformer‑based vector encoding, to nearest‑neighbor retrieval—illustrating how vector search now dominates modern recommendation systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Code DAO
We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
