Artificial Intelligence 8 min read

Building a Vector‑Based Movie Recommendation System with Transformers

This tutorial walks through constructing a movie recommendation engine by downloading a dataset, cleaning and de‑duplicating entries, encoding plot summaries into vectors with transformer models, and performing nearest‑neighbor searches using scikit‑learn, while handling misspellings with Levenshtein distance.

Code DAO

Dec 26, 2021

Building a Vector‑Based Movie Recommendation System with Transformers

The article demonstrates how to build a simplified movie recommendation system that relies on vector search and transformer‑based embeddings.

Step 1 – Install required libraries

!pip install -U relevanceai
!pip install tqdm>=4.62.2

Step 2 – Download the MPST movie‑plot dataset from Kaggle and load it with pandas.

import pandas as pd
df = pd.read_csv('mpst_full_data.csv')

Step 3 – Preprocess the dataset

Sort titles alphabetically and compute Levenshtein distance between consecutive titles to identify near‑duplicates (distance ≤ 3).

Filter out the duplicates and reset the index.

df = df.sort_values('title').reset_index(drop=True)
df['lev'] = None
from Levenshtein import distance
for a in range(len(df)-1):
    if distance(df.iloc[a].title, df.iloc[a+1].title) <= 3:
        df.at[a, 'lev'] = distance(df.iloc[a].title, df.iloc[a+1].title)
df = df[df['lev'].isnull()].reset_index(drop=True)
# Manual check for famous titles, e.g., remove duplicate "Avengers"
df = df.drop([9572]).reset_index(drop=True)
df.to_csv('mpst_no_duplicates.csv')

Step 4 – Encode plot synopses into vectors

First, create a list of dictionaries containing the title and plot:

json_files = df[['title', 'plot_synopsis']]
json_files = json_files.reset_index()
json_files.columns = ['_id', 'title', 'plot_synopsis']
json_files = json_files.to_dict(orient='records')

Encode with the relevanceai API using a BERT model:

from vectorhub.encoders.text.sentence_transformers import SentenceTransformer2Vec
model = SentenceTransformer2Vec('bert-base-uncased')
df_json = model.encode_documents(documents=json_files, fields=['plot_synopsis'])
# The new vector field is named plot_synopsis_sentence_transformers_vector_

Alternatively, encode locally with sentence_transformers and tqdm for progress tracking:

from tqdm import tqdm
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
df_ = df.copy()
df_['plot_synopsis'] = df_['plot_synopsis'].progress_apply(lambda x: model.encode(x))
df_index = df_.pop('title')
df_ = pd.DataFrame(np.column_stack(list(zip(*df_.values))))
df_.index = df_index
df_.to_csv('mpst_encoded_no_duplicates.csv')

Step 5 – Perform vector search with scikit‑learn

import pandas as pd
from sklearn.neighbors import NearestNeighbors

df_movies_encoded = pd.read_csv('mpst_encoded_no_duplicates.csv')
df_movies_encoded.index = df_movies_encoded.pop('title')

nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(df_movies_encoded)

Because the dataset may lack an exact title or contain typos, a helper function uses Levenshtein distance to find the closest existing title before querying the nearest‑neighbor model:

from Levenshtein import distance

def closest_title(title):
    m = pd.DataFrame(df_movies_encoded.index, columns=['title'])
    m['lev'] = m['title'].apply(lambda x: distance(x, title))
    return m.sort_values('lev').iloc[0]['title']

def find_similar_movies(df, nbrs, title):
    title = closest_title(title)
    distances, indices = nbrs.kneighbors([df.loc[title]])
    for idx in indices[0][1:]:
        print('title', title, '->', df.iloc[idx].name)

Example usage:

find_similar_movies(df_movies_encoded, nbrs, 'Prince of Egypt')
# Output: The Prince of Egypt -> The Ten Commandments: The Musical

The system returns the five nearest movies (the example uses two neighbors for brevity). The article notes that searching for a title not present in the dataset, such as "The Avengers" when only "Avengers" exists, will fail unless the Levenshtein‑based fallback is applied.

Conclusion

The guide shows a complete pipeline—from data acquisition and cleaning, through transformer‑based vector encoding, to nearest‑neighbor retrieval—illustrating how vector search now dominates modern recommendation systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

vector search transformers pandas scikit-learn Levenshtein distance movie recommendation relevanceai

Written by

Code DAO

We deliver AI algorithm tutorials and the latest news, curated by a team of researchers from Peking University, Shanghai Jiao Tong University, Central South University, and leading AI companies such as Huawei, Kuaishou, and SenseTime. Join us in the AI alchemy—making life better!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.