Artificial Intelligence 16 min read

Building an Offline Recommendation System with Mahout: Practical Steps and Tips

This article walks through the end‑to‑end process of building an offline recommendation system using Mahout, covering data collection, filtering, storage, various collaborative‑filtering algorithms, similarity measures, evaluation metrics, parameter tuning, AB testing, and spam‑fighting strategies.

21CTO

Oct 24, 2015

Building an Offline Recommendation System with Mahout: Practical Steps and Tips

Data Preparation

Typical recommendation data comes from either real‑time user actions or batch logs stored in databases. The author collects logs from anti‑fraud servers via a PHP script, merges them daily, and notes that some records are pre‑filtered to remove obvious spam.

Data Filtering

A Python module implements a chain‑of‑responsibility filter that removes items with only one rating, blacklisted users or items, and other unwanted data before feeding it to the model.

Data Storage

Storage strategy depends on the chosen algorithm and infrastructure; the author uses incremental daily processing with weekly rollbacks, grouping 40 items per user in a bitmap‑like structure.

Recommendation Algorithms (Mahout)

The core algorithms are implemented with Mahout. User‑based collaborative filtering first computes a similarity matrix

UserSimilarity similarity = new PearsonCorrelationSimilarity(model);

and defines neighborhoods

UserNeighborhood neighborhood = new NearestNUserNeighborhood(2, similarity, model);

. The recommender is built with

Recommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity);

and evaluated using an evaluator that splits data 95%/5%.

Neighborhood Types

Fixed‑size neighborhoods (e.g., 2 nearest neighbors)

Threshold‑based neighborhoods (e.g., new ThresholdUserNeighborhood(0.7, similarity, model))

Algorithm Variants

Item‑based (fast because items are fewer than users)

Slope‑One ( new SlopeOneRecommender(model)) with count‑weighting or standard‑deviation weighting

Singular Value Decomposition (SVD) via

new SVDRecommender(model, new ALSWRFactorizer(model, 10, 0.05, 10))

K‑Nearest‑Neighbor item‑based (

new KnnItemBasedRecommender(model, similarity, optimizer, 10)

)

Cluster‑based recommendation using TreeClusteringRecommender and

FarthestNeighborClusterSimilarity

Similarity Measures

Mahout provides several similarity implementations:

PearsonCorrelationSimilarity (range –1 to 1, works on centered data)

EuclideanDistanceSimilarity (returns 1/(1+d))

CosineSimilarity (equivalent to Pearson on zero‑mean data)

SpearmanCorrelationSimilarity (rank‑based, slower)

TanimotoCoefficientSimilarity (uses preference values when signal dominates noise)

LogLikelihoodSimilarity (measures how unlikely co‑occurrences are by chance)

Preference Inference

For sparse data, Mahout offers AveragingPreferenceInferrer via setPreferenceInferrer(), though it rarely improves results in practice.

Evaluation

Algorithms are compared using accuracy, recall, and coverage. Offline tests split data into training and test sets; results are visualized with tools like Highcharts.

Parameter Tuning & Deployment

Before going live, perform offline testing and AB testing. Offline testing evaluates metrics on sampled data, while AB testing measures real‑world impact such as PV/UV conversion rates by routing users to different algorithm versions.

Spam Fighting & Rule Engine

Spam attacks are categorized as average, random, or nuke. Robustness improvements and rule‑based detection (e.g., IP checks, account similarity, abnormal PV) are applied both during data collection and after recommendation generation. Visual examples are shown below: