Big Data 16 min read

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Facebook replaced a multi‑stage Hive pipeline for real‑time entity ranking with a single Spark job, applying extensive reliability fixes and performance tweaks that reduced CPU usage by up to six times, cut latency fivefold, and demonstrated the feasibility of shuffling over 90 TB of data in production.

ITPUB

Dec 27, 2019

How Facebook Scaled Entity Ranking from Hive to Spark: Lessons and Performance Gains

Background and Motivation

Facebook processes tens of terabytes per query in its analytics engine, historically using Hive and an internal MapReduce system (Corona). To support growing data volumes and improve maintainability, Facebook explored replacing Hive jobs with Apache Spark, which can handle batch, streaming, graph, and machine‑learning workloads on a unified API.

Use Case: Entity Ranking Feature Preparation

Real‑time entity ranking relies on features generated offline by Hive jobs and loaded into online query systems. The original Hive pipeline consisted of hundreds of small jobs across three logical stages, taking about three days to run and being difficult to monitor.

Previous Hive Implementation

The three stages were:

Filter out irrelevant features and noise.

Aggregate per (entity_id, target_id) pair.

Shard the result and generate custom index files via a UDF for online lookup.

This approach required massive parallelism, produced many output files, and lacked a clear way to track overall progress or ETA.

Spark Implementation

Rather than rewriting the entire pipeline, the team first targeted the most resource‑intensive second stage. Starting with a 50 GB compressed sample, they scaled up to 20 TB, discovering that the job generated an excessive number of ~100 MB output files, causing three hours of a ten‑hour run to be spent moving files in HDFS.

They introduced three key changes:

Made PipedRDD tolerant to node restarts (SPARK‑13793).

Made the maximum fetch‑failure count configurable, increasing it from 4 to 20 (SPARK‑13369).

Implemented a less‑disruptive cluster restart by preserving shuffle files and pausing task scheduling during driver restarts.

Additional reliability fixes addressed driver unresponsiveness, excessive speculation, a TimSort integer‑overflow bug, shuffle service connection limits, executor OOM issues, and more.

Performance Optimizations

After stabilizing the job, the team focused on speed:

Fixed a memory‑leak in the sorter (SPARK‑14363) – 30% CPU gain.

Replaced JNI‑based Snappy compression with a pure‑Java copy (SPARK‑14277) – 10% gain.

Reduced shuffle‑write latency by avoiding repeated file open/close (SPARK‑5581) – ~50% gain.

Eliminated duplicate task runs on fetch failures (SPARK‑14649).

Made PipedRDD buffer size configurable (SPARK‑14542) – ~10% gain.

Cached shuffle index files (SPARK‑15074) – 50% reduction in fetch time.

Reduced overhead of shuffle‑bytes‑written metrics (SPARK‑15569) – ~20% gain.

Increased initial sorter buffer size (SPARK‑15958) – ~5% gain.

Adjusted input split size to limit task count, improving overall throughput.

Reliability Fixes

Key reliability improvements included making PipedRDD tolerant to node restarts, configuring fetch‑failure retries, and enhancing the shuffle service to survive cluster restarts.

Performance Comparison

The Spark job processed 60 TB of compressed input, performed 90 TB of shuffle and sort, and ran 250,000 tasks in about 10 hours. Compared with the original Hive pipeline, Spark achieved 4.5–6× CPU performance improvement, 3–4× resource savings, and roughly 5× lower latency.

Conclusion and Future Work

By consolidating hundreds of Hive jobs into a single, well‑tuned Spark job, Facebook demonstrated that Spark can reliably shuffle and sort over 90 TB of intermediate data, run hundreds of thousands of tasks, and deliver substantial performance and resource efficiency gains. The reliability and performance improvements are being contributed back to the open‑source Spark project, and the approach is now used for other large‑scale analytics workloads.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Big Data Hive Reliability Spark

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.