Big Data 16 min read

How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Facebook migrated a massive, multi‑stage Hive‑based entity ranking pipeline to a single Spark job, detailing the challenges of scaling to 20 TB inputs, the reliability fixes, performance optimizations, and the resulting 4‑6× CPU speedup and reduced latency.

dbaplus Community
dbaplus Community
dbaplus Community
How Facebook Replaced Hundreds of Hive Jobs with a Single Spark Pipeline

Background

Facebook relies on data‑driven analysis for product decisions, processing tens of terabytes per query. Historically, many batch analytics ran on Hive and an internal MapReduce system called Corona. To support a growing set of use cases—including graph computation, machine learning (Giraph), and stream processing (Puma, Swift, Stylus)—Facebook expanded its Presto ANSI‑SQL coverage and began evaluating Apache Spark.

Why Move from Hive to Spark

Real‑time entity ranking pipelines used Hive to generate feature values offline, then loaded them into online query systems. The Hive jobs consisted of hundreds of small tasks split by entity_id, consuming large resources and being hard to maintain. Running the full Hive pipeline took about three days and offered little visibility into progress or ETA.

Original Hive Implementation

The Hive workflow had three logical stages, each broken into hundreds of sub‑jobs:

Filter out non‑production features and noise.

Aggregate per (entity_id, target_id) pair.

Shard the table into N partitions and generate a custom index file for online queries via a user‑defined function.

These stages required three days of execution and were difficult to monitor because of the sheer number of fragment jobs.

Hive pipeline diagram
Hive pipeline diagram

Transition to Spark

Rather than rewrite the entire pipeline at once, the team focused on the most resource‑intensive second stage. Starting with a 50 GB compressed sample, they scaled to 20 TB, encountering performance and stability issues. At 20 TB, the job generated an excessive number of output files (~100 MB each), and three of ten hours were spent moving files from a staging directory to HDFS.

Two initial solutions were considered: improving bulk rename in HDFS or configuring Spark to emit fewer files. The final approach combined both ideas by collapsing the three Hive stages into a single Spark job that reads 60 TB of compressed data, performs 90 TB of shuffle and sort, and writes the final index.

Spark job diagram
Spark job diagram

Reliability Fixes

Made PipedRDD tolerant to node restarts (SPARK‑13793) by redesigning its failure handling.

Made the maximum fetch failure count configurable (SPARK‑13369), increasing it from 4 to 20.

Implemented a less disruptive cluster restart by preserving shuffle files and adding driver‑side task‑scheduling pause.

Removed an O(N²) driver operation that caused driver hangs (SPARK‑13279).

Disabled excessive driver speculation for large jobs.

Fixed a TimSort integer‑overflow bug (SPARK‑13850).

Increased shuffle service threads and backlog to handle many connections.

Fixed executor OOM caused by an unbounded pointer array in the sorter (SPARK‑13958), allowing 24 tasks per node.

Performance Optimizations

Fixed a memory leak in the sorter (SPARK‑14363), gaining ~30% CPU performance.

Optimized Snappy compression by replacing JNI calls with System.ArrayCopy, improving CPU performance by ~10% (SPARK‑14277).

Reduced shuffle write latency by avoiding repeated file open/close (SPARK‑5581), yielding ~50% speedup.

Prevented duplicate task runs after fetch failures (SPARK‑14649), improving stability and performance.

Made PipedRDD buffer size configurable (SPARK‑14542), gaining ~10% speedup.

Cached shuffle index files (SPARK‑15074), cutting shuffle fetch time by ~50%.

Reduced overhead of shuffle‑bytes‑written metrics (SPARK‑15569), improving CPU usage by ~20%.

Increased initial sorter buffer size (SPARK‑15958) to 64 MB, achieving ~5% speedup.

Adjusted number of tasks by configuring input split size to 2 GB, reducing task count eightfold (SPARK‑???).

Performance Comparison

The team measured CPU time, CPU reservation time, and latency for both Hive and Spark runs. CPU time reflects actual CPU usage, while CPU reservation time shows the amount of CPU resources reserved by the cluster scheduler. The Spark pipeline achieved 4.5–6× CPU performance improvement, used 3–4× fewer resources, and reduced latency by roughly fivefold.

CPU time vs reservation
CPU time vs reservation
Latency comparison
Latency comparison

Conclusion and Future Work

Facebook’s experience shows that a well‑engineered Spark job can reliably shuffle and sort over 90 TB of intermediate data, run 250 000 tasks in a single job, and deliver substantial performance gains over a legacy Hive pipeline. The reliability and performance improvements made for this entity‑ranking use case are applicable to other large‑scale data‑processing workloads, and the changes have been contributed back to the open‑source Apache Spark project.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DatahiveReliabilitySpark
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.