How Facebook Scales Recommendations with Distributed Machine Learning and Giraph

This article explains how Facebook tackles massive recommendation data—over 100 billion ratings—by using distributed collaborative filtering, matrix factorization, SGD/ALS hybrid algorithms, and a novel work‑to‑work communication scheme built on Apache Giraph to achieve high performance and scalability.

21CTO
21CTO
21CTO
How Facebook Scales Recommendations with Distributed Machine Learning and Giraph

To ensure user experience, recommendation systems traditionally train on complete datasets, but the rapid growth of input data makes centralized machine‑learning algorithms insufficient, prompting the adoption of distributed machine‑learning approaches. Facebook, as the world’s largest social platform, therefore employs a distributed recommendation system.

Facebook’s recommendation data comprises roughly 100 billion ratings, over 1 billion users, and millions of items—far exceeding the scale of the Netflix Prize. To handle this, Facebook built a new system on top of Apache Giraph, a distributed iterative graph‑processing platform suited for large‑scale data.

The system relies on collaborative filtering (CF), predicting a user’s preference by analyzing ratings from similar users. Mathematically, this involves matrix factorization (MF), representing the rating matrix R as the product of a user matrix and an item matrix, and minimizing the distance between R and the reconstructed matrix R′.

Because solving MF on massive data is computationally intensive, iterative algorithms that start from random feature vectors are used to reduce time and space complexity. Stochastic Gradient Descent (SGD) traverses the training set randomly, updating user and item feature vectors to reduce prediction error, while Alternating Least Squares (ALS) alternately fixes one set of vectors and solves for the other, seeking a local optimum.

In Giraph’s original model, users and items are vertices and known ratings are edges, so SGD or ALS would require traversing every edge each iteration. This leads to enormous network traffic (e.g., 80 TB per iteration for 100 billion ratings with 100‑dimensional vectors) and uneven node degrees that cause memory bottlenecks. Moreover, Giraph’s implementation does not always use the latest global feature vectors during updates.

To overcome these issues, Facebook devised a work‑to‑work message‑passing scheme. The graph is partitioned into a ring of workers, each containing a subset of items and users. In each step, a worker sends its item‑update information clockwise to the next worker, processing only internal ratings per step. After a number of steps equal to the worker count, all ratings are processed, achieving communication volume independent of the total number of ratings and eliminating degree‑imbalance problems.

Facebook further merged SGD and ALS into a rotational hybrid solving method to boost algorithmic performance.

Performance was evaluated through A/B testing: the system fine‑tuned parameters on a fixed training set, measured prediction accuracy on test sets using metrics such as average item rating, precision for top‑1/10/100 items, overall precision, and Root Mean Squared Error (RMSE).

Even with distributed computation, checking every user‑item pair remains impractical. Facebook explored faster top‑K recommendation techniques, such as using a ball‑tree structure to accelerate vector searches (10‑100× speedup) or clustering items to first select promising groups and then the highest‑scoring items within those groups, trading some accuracy for speed.

Experimental results released in July 2014 showed that Facebook’s hybrid method on Spark MLlib outperformed a standard ALS implementation by roughly tenfold, comfortably handling over 100 billion ratings.

The approach is now deployed across multiple Facebook applications, including page and group recommendations. To reduce system load, only pages/groups with degree > 100 are considered candidates. The initial iteration incorporates both liked and disliked pages/groups, and indirect feedback is captured via ALS. Future work includes leveraging the social graph for better recommendation sets, automating parameter tuning, and exploring improved partitioning strategies.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

collaborative filteringRecommendation SystemsFacebookdistributed machine learningALSSGDApache Giraph
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.