Artificial Intelligence 10 min read

Introducing DeepMatch: An Open‑Source Library for Deep Retrieval Matching Algorithms

DeepMatch is an open‑source Python library that implements several mainstream deep‑learning based recall‑matching algorithms, provides easy installation via pip, detailed usage examples with code, and supports exporting user and item vectors for ANN search, making it ideal for rapid experimentation and learning in recommendation systems.

DataFunTalk
DataFunTalk
DataFunTalk
Introducing DeepMatch: An Open‑Source Library for Deep Retrieval Matching Algorithms

DeepMatch is an open‑source project that implements a collection of mainstream deep‑learning based recall‑matching algorithms and enables quick export of user and item vectors for approximate nearest neighbor (ANN) retrieval, making it suitable for fast experiments and learning in recommendation systems.

The project background explains the two‑stage architecture of modern recommendation and advertising systems (recall followed by ranking) and describes the author’s experience building a recommendation system with vector‑based recall, which motivated the creation of DeepMatch.

Installation is straightforward via pip:

pip install -U deepmatch

Documentation and examples are available at https://deepmatch.readthedocs.io/en/latest/ . The repository can be found at https://github.com/shenweichen/DeepMatch .

An example using the YoutubeDNN model demonstrates how to train a recall model, export user and item embeddings, and optionally perform ANN search with Faiss. The full example code is shown below:

import pandas as pd from deepctr.inputs import SparseFeat, VarLenSparseFeat from preprocess import gen_data_set, gen_model_input from sklearn.preprocessing import LabelEncoder from tensorflow.python.keras import backend as K from tensorflow.python.keras.models import Model from deepmatch.models import * from deepmatch.utils import sampledsoftmaxloss # Load data data = pd.read_csv("./movielens_sample.txt") # Define sparse features sparse_features = ["movie_id", "user_id", "gender", "age", "occupation", "zip"] SEQ_LEN = 50 negsample = 0 # Encode features features = ['user_id', 'movie_id', 'gender', 'age', 'occupation', 'zip'] feature_max_idx = {} for feature in features: lbe = LabelEncoder() data[feature] = lbe.fit_transform(data[feature]) + 1 feature_max_idx[feature] = data[feature].max() + 1 # Build user and item profiles user_profile = data[["user_id", "gender", "age", "occupation", "zip"]].drop_duplicates('user_id') item_profile = data[["movie_id"]].drop_duplicates('movie_id') user_profile.set_index("user_id", inplace=True) user_item_list = data.groupby("user_id")["movie_id"].apply(list) # Generate train/test sets train_set, test_set = gen_data_set(data, negsample) train_model_input, train_label = gen_model_input(train_set, user_profile, SEQ_LEN) test_model_input, test_label = gen_model_input(test_set, user_profile, SEQ_LEN) # Feature columns embedding_dim = 16 user_feature_columns = [ SparseFeat('user_id', feature_max_idx['user_id'], embedding_dim), SparseFeat('gender', feature_max_idx['gender'], embedding_dim), SparseFeat('age', feature_max_idx['age'], embedding_dim), SparseFeat('occupation', feature_max_idx['occupation'], embedding_dim), SparseFeat('zip', feature_max_idx['zip'], embedding_dim), VarLenSparseFeat(SparseFeat('hist_movie_id', feature_max_idx['movie_id'], embedding_dim, embedding_name='movie_id'), SEQ_LEN, 'mean', 'hist_len') ] item_feature_columns = [SparseFeat('movie_id', feature_max_idx['movie_id'], embedding_dim)] # Build and train model K.set_learning_phase(True) model = YoutubeDNN(user_feature_columns, item_feature_columns, num_sampled=5, user_dnn_hidden_units=(64, 16)) model.compile(optimizer='adagrad', loss=sampledsoftmaxloss) model.fit(train_model_input, train_label, batch_size=256, epochs=1, verbose=1) # Export embeddings user_embedding_model = Model(inputs=model.user_input, outputs=model.user_embedding) item_embedding_model = Model(inputs=model.item_input, outputs=model.item_embedding) user_embs = user_embedding_model.predict(test_model_input, batch_size=2**12) item_embs = item_embedding_model.predict({"movie_id": item_profile['movie_id'].values, "movie_idx": item_profile['movie_id'].values}, batch_size=2**12) # Optional ANN search with Faiss import faiss, numpy as np index = faiss.IndexFlatIP(embedding_dim) index.add(item_embs) D, I = index.search(user_embs, 50) # Compute recall and hit rate ...

The article also lists contributors, provides contact information, and includes a recruitment notice for positions at Alibaba’s commercial machine intelligence department, inviting interested candidates to apply.

Finally, readers are encouraged to star the GitHub repository, join the DataFunTalk community, and follow additional recommended articles linked at the end of the page.

PythonDeep Learningopen-sourceRecommendation systemsVector RetrievalANN
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.