Introducing DeepMatch: An Open‑Source Library for Deep Retrieval Matching Algorithms
DeepMatch is an open‑source Python library that implements several mainstream deep‑learning based recall‑matching algorithms, provides easy installation via pip, detailed usage examples with code, and supports exporting user and item vectors for ANN search, making it ideal for rapid experimentation and learning in recommendation systems.
DeepMatch is an open‑source project that implements a collection of mainstream deep‑learning based recall‑matching algorithms and enables quick export of user and item vectors for approximate nearest neighbor (ANN) retrieval, making it suitable for fast experiments and learning in recommendation systems.
The project background explains the two‑stage architecture of modern recommendation and advertising systems (recall followed by ranking) and describes the author’s experience building a recommendation system with vector‑based recall, which motivated the creation of DeepMatch.
Installation is straightforward via pip: pip install -U deepmatch Documentation and examples are available at https://deepmatch.readthedocs.io/en/latest/ . The repository can be found at https://github.com/shenweichen/DeepMatch .
An example using the YoutubeDNN model demonstrates how to train a recall model, export user and item embeddings, and optionally perform ANN search with Faiss. The full example code is shown below:
import pandas as pd
from deepctr.inputs import SparseFeat, VarLenSparseFeat
from preprocess import gen_data_set, gen_model_input
from sklearn.preprocessing import LabelEncoder
from tensorflow.python.keras import backend as K
from tensorflow.python.keras.models import Model
from deepmatch.models import *
from deepmatch.utils import sampledsoftmaxloss
# Load data
data = pd.read_csv("./movielens_sample.txt")
# Define sparse features
sparse_features = ["movie_id", "user_id", "gender", "age", "occupation", "zip"]
SEQ_LEN = 50
negsample = 0
# Encode features
features = ['user_id', 'movie_id', 'gender', 'age', 'occupation', 'zip']
feature_max_idx = {}
for feature in features:
lbe = LabelEncoder()
data[feature] = lbe.fit_transform(data[feature]) + 1
feature_max_idx[feature] = data[feature].max() + 1
# Build user and item profiles
user_profile = data[["user_id", "gender", "age", "occupation", "zip"]].drop_duplicates('user_id')
item_profile = data[["movie_id"]].drop_duplicates('movie_id')
user_profile.set_index("user_id", inplace=True)
user_item_list = data.groupby("user_id")["movie_id"].apply(list)
# Generate train/test sets
train_set, test_set = gen_data_set(data, negsample)
train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN)
test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN)
# Feature columns
embedding_dim = 16
user_feature_columns = [
SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim),
SparseFeat('gender', feature_max_idx['gender'], embedding_dim),
SparseFeat('age', feature_max_idx['age'], embedding_dim),
SparseFeat('occupation', feature_max_idx['occupation'], embedding_dim),
SparseFeat('zip', feature_max_idx['zip'], embedding_dim),
VarLenSparseFeat(SparseFeat('hist_movie_id', feature_max_idx['movie_id'], embedding_dim, embedding_name='movie_id'), SEQ_LEN, 'mean', 'hist_len')
]
item_feature_columns = [SparseFeat('movie_id', feature_max_idx['movie_id'], embedding_dim)]
# Build and train model
K.set_learning_phase(True)
model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, 16))
model.compile(optimizer='adagrad', loss=sampledsoftmaxloss)
model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1)
# Export embeddings
user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding)
item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding)
user_embs = user_embedding_model.predict(test_model_input, batch_size=2**12)
item_embs = item_embedding_model.predict({"movie_id": item_profile['movie_id'].values, "movie_idx": item_profile['movie_id'].values}, batch_size=2**12)
# Optional ANN search with Faiss
import faiss, numpy as np
index = faiss.IndexFlatIP(embedding_dim)
index.add(item_embs)
D, I = index.search(user_embs, 50)
# Compute recall and hit rate ...The article also lists contributors, provides contact information, and includes a recruitment notice for positions at Alibaba’s commercial machine intelligence department, inviting interested candidates to apply.
Finally, readers are encouraged to star the GitHub repository, join the DataFunTalk community, and follow additional recommended articles linked at the end of the page.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
