How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline
Facebook migrated a massive, multi‑stage Hive‑based entity ranking pipeline to a single Spark job, detailing the challenges of scaling to 20 TB inputs, the reliability fixes, performance optimizations, and the resulting 4‑6× CPU speedup and reduced latency.
Background
Facebook relies on data‑driven analysis for product decisions, processing tens of terabytes per query. Historically, many batch analytics ran on Hive and an internal MapReduce system called Corona. To support a growing set of use cases—including graph computation, machine learning (Giraph), and stream processing (Puma, Swift, Stylus)—Facebook expanded its Presto ANSI‑SQL coverage and began evaluating Apache Spark.
Why Move from Hive to Spark
Real‑time entity ranking pipelines used Hive to generate feature values offline, then loaded them into online query systems. The Hive jobs consisted of hundreds of small tasks split by entity_id, consuming large resources and being hard to maintain. Running the full Hive pipeline took about three days and offered little visibility into progress or ETA.
Original Hive Implementation
The Hive workflow had three logical stages, each broken into hundreds of sub‑jobs:
Filter out non‑production features and noise.
Aggregate per (entity_id, target_id) pair.
Shard the table into N partitions and generate a custom index file for online queries via a user‑defined function.
These stages required three days of execution and were difficult to monitor because of the sheer number of fragment jobs.
Transition to Spark
Rather than rewrite the entire pipeline at once, the team focused on the most resource‑intensive second stage. Starting with a 50 GB compressed sample, they scaled to 20 TB, encountering performance and stability issues. At 20 TB, the job generated an excessive number of output files (~100 MB each), and three of ten hours were spent moving files from a staging directory to HDFS.
Two initial solutions were considered: improving bulk rename in HDFS or configuring Spark to emit fewer files. The final approach combined both ideas by collapsing the three Hive stages into a single Spark job that reads 60 TB of compressed data, performs 90 TB of shuffle and sort, and writes the final index.
Reliability Fixes
Made PipedRDD tolerant to node restarts (SPARK‑13793) by redesigning its failure handling.
Made the maximum fetch failure count configurable (SPARK‑13369), increasing it from 4 to 20.
Implemented a less disruptive cluster restart by preserving shuffle files and adding driver‑side task‑scheduling pause.
Removed an O(N²) driver operation that caused driver hangs (SPARK‑13279).
Disabled excessive driver speculation for large jobs.
Fixed a TimSort integer‑overflow bug (SPARK‑13850).
Increased shuffle service threads and backlog to handle many connections.
Fixed executor OOM caused by an unbounded pointer array in the sorter (SPARK‑13958), allowing 24 tasks per node.
Performance Optimizations
Fixed a memory leak in the sorter (SPARK‑14363), gaining ~30% CPU performance.
Optimized Snappy compression by replacing JNI calls with System.ArrayCopy, improving CPU performance by ~10% (SPARK‑14277).
Reduced shuffle write latency by avoiding repeated file open/close (SPARK‑5581), yielding ~50% speedup.
Prevented duplicate task runs after fetch failures (SPARK‑14649), improving stability and performance.
Made PipedRDD buffer size configurable (SPARK‑14542), gaining ~10% speedup.
Cached shuffle index files (SPARK‑15074), cutting shuffle fetch time by ~50%.
Reduced overhead of shuffle‑bytes‑written metrics (SPARK‑15569), improving CPU usage by ~20%.
Increased initial sorter buffer size (SPARK‑15958) to 64 MB, achieving ~5% speedup.
Adjusted number of tasks by configuring input split size to 2 GB, reducing task count eightfold (SPARK‑???).
Performance Comparison
The team measured CPU time, CPU reservation time, and latency for both Hive and Spark runs. CPU time reflects actual CPU usage, while CPU reservation time shows the amount of CPU resources reserved by the cluster scheduler. The Spark pipeline achieved 4.5–6× CPU performance improvement, used 3–4× fewer resources, and reduced latency by roughly fivefold.
Conclusion and Future Work
Facebook’s experience shows that a well‑engineered Spark job can reliably shuffle and sort over 90 TB of intermediate data, run 250 000 tasks in a single job, and deliver substantial performance gains over a legacy Hive pipeline. The reliability and performance improvements made for this entity‑ranking use case are applicable to other large‑scale data‑processing workloads, and the changes have been contributed back to the open‑source Apache Spark project.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
