Building an Offline Recommendation System with Mahout: Practical Steps and Tips
This article walks through the end‑to‑end process of building an offline recommendation system using Mahout, covering data collection, filtering, storage, various collaborative‑filtering algorithms, similarity measures, evaluation metrics, parameter tuning, AB testing, and spam‑fighting strategies.
Data Preparation
Typical recommendation data comes from either real‑time user actions or batch logs stored in databases. The author collects logs from anti‑fraud servers via a PHP script, merges them daily, and notes that some records are pre‑filtered to remove obvious spam.
Data Filtering
A Python module implements a chain‑of‑responsibility filter that removes items with only one rating, blacklisted users or items, and other unwanted data before feeding it to the model.
Data Storage
Storage strategy depends on the chosen algorithm and infrastructure; the author uses incremental daily processing with weekly rollbacks, grouping 40 items per user in a bitmap‑like structure.
Recommendation Algorithms (Mahout)
The core algorithms are implemented with Mahout. User‑based collaborative filtering first computes a similarity matrix
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);and defines neighborhoods
UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);. The recommender is built with
Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);and evaluated using an evaluator that splits data 95%/5%.
Neighborhood Types
Fixed‑size neighborhoods (e.g., 2 nearest neighbors)
Threshold‑based neighborhoods (e.g., new ThresholdUserNeighborhood(0.7, similarity, model))
Algorithm Variants
Item‑based (fast because items are fewer than users)
Slope‑One ( new SlopeOneRecommender(model)) with count‑weighting or standard‑deviation weighting
Singular Value Decomposition (SVD) via
new SVDRecommender(model, new ALSWRFactorizer(model, 10, 0.05, 10))K‑Nearest‑Neighbor item‑based (
new KnnItemBasedRecommender(model, similarity, optimizer, 10))
Cluster‑based recommendation using TreeClusteringRecommender and
FarthestNeighborClusterSimilaritySimilarity Measures
Mahout provides several similarity implementations:
PearsonCorrelationSimilarity (range –1 to 1, works on centered data)
EuclideanDistanceSimilarity (returns 1/(1+d))
CosineSimilarity (equivalent to Pearson on zero‑mean data)
SpearmanCorrelationSimilarity (rank‑based, slower)
TanimotoCoefficientSimilarity (uses preference values when signal dominates noise)
LogLikelihoodSimilarity (measures how unlikely co‑occurrences are by chance)
Preference Inference
For sparse data, Mahout offers AveragingPreferenceInferrer via setPreferenceInferrer(), though it rarely improves results in practice.
Evaluation
Algorithms are compared using accuracy, recall, and coverage. Offline tests split data into training and test sets; results are visualized with tools like Highcharts.
Parameter Tuning & Deployment
Before going live, perform offline testing and AB testing. Offline testing evaluates metrics on sampled data, while AB testing measures real‑world impact such as PV/UV conversion rates by routing users to different algorithm versions.
Spam Fighting & Rule Engine
Spam attacks are categorized as average, random, or nuke. Robustness improvements and rule‑based detection (e.g., IP checks, account similarity, abnormal PV) are applied both during data collection and after recommendation generation. Visual examples are shown below:
Visualization
Parameter effects are plotted using Highcharts (http://www.highcharts.com/), allowing quick comparison of metrics across algorithm variants.
Source: Douban
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
