How Uber Upgraded Over 2 Million Spark Jobs from 2.4 to 3.3
Uber migrated more than two million daily Spark applications from version 2.4 to 3.3, detailing the motivations, architecture, four-step migration process, custom tools like Polyglot Piranha and Iron Dome, and the resulting performance, cost, and productivity gains.
Introduction
Uber runs over 2 million Spark applications each day, making it one of the largest Spark deployments worldwide. All workloads originally used Spark 2.4. This article explains how Uber migrated to Spark 3.3 and the automation built to support the move.
Why Upgrade to Spark 3.3
Leverage Spark 3.3 optimizations such as Adaptive Query Execution and Dynamic Partition Pruning to improve efficiency and cut costs.
Address known CVEs and improve security.
Benefit from Koalas™ (pandas on PySpark) to boost developer productivity.
Adopt additional optimizations like Apache Gluten™ and Velox.
Stay aligned with the latest open‑source contributions.
Architecture Overview
Spark jobs at Uber are written in Java®, Scala, or Python and are scheduled on either Apache Hadoop® YARN or Kubernetes®. Jobs run either as part of scheduled Piper pipelines or via ad‑hoc CLI/script execution. Code resides in Uber’s monorepo or micro‑repos.
Submission is handled by Drogon, an Apache Livy™ proxy that integrates with Jupyter® notebooks, PySpark, Java applications, and Spark Shell. Jobs are submitted to YARN or Kubernetes based on user preference. Uber also uses an internal shuffle manager, Zeus, with fallback to an external shuffle service (ESS).
Four‑Step Migration Process
Prepare Binaries – Synchronize Uber’s custom Spark branch with the upstream open‑source version and incorporate internal extensions such as task‑level event listeners, column‑level data lineage, and the Zeus shuffle manager.
Achieve Ecosystem Compatibility – Resolve dependency gaps in the Spark ecosystem (e.g., libraries, connectors) before moving to the new version.
Adapt Source Code – Use Uber’s open‑source tool Polyglot Piranha to parse source code into an abstract syntax tree (AST), apply structured search rules, and automatically rewrite patterns that need changes for Spark 3.
Data Validation & Shadow Testing – Run extensive validation and shadow tests to compare behavior between Spark 2.4 and 3.3.
Polyglot Piranha Workflow
Parse source code into an AST and locate target patterns with structured search rules.
Apply a predefined set of transformation rules once patterns are detected.
Define additional rules to adapt code to Spark 3 semantics.
Example: the tool automatically adds legacy configuration required by Spark 3 during SparkConf initialization.
Validation Challenges
More than 40 000 Spark applications make decentralized validation infeasible.
No pre‑production environment; running “dry‑run” jobs in production could affect live data.
Lack of existing test cases and baseline data for validation.
Iron Dome Framework
Because open‑source tools could not meet Uber’s validation needs, the team built Iron Dome , which provides a safe sandbox for shadow testing.
Secure Rewrite – Intercept Spark’s Catalog API and Hadoop’s File Output Committer to rewrite production paths (e.g., /db/tbl → /stgdb/tbl) during testing.
Protection Mechanism – Add safeguards at the Hadoop FileSystem layer to prevent accidental writes to production locations.
Verification Infrastructure – Capture tables and paths accessed by a job and emit telemetry to a message queue for comparison with production results.
Task Orchestration – Use Uber’s Cadence workflow engine to automate shadow testing, data verification, and migration tagging for all jobs.
Results and Impact
85% migration rate – Over 20 000 tasks were migrated to Spark 3 within six months using Iron Dome and automated orchestration.
Performance gains – More than 60% of tasks saw efficiency improvements exceeding 10%, saving millions of dollars in compute costs.
Productivity boost – Automation eliminated manual migration effort, saving thousands of engineer hours.
The migration also opened the path for future upgrades, such as support for Kubernetes, JDK 17, and the upcoming Spark 4 release. Uber has open‑sourced several patches and plans to apply the same automation framework to other infrastructure upgrades.
Conclusion
Uber successfully upgraded 100% of its Spark applications to version 3.3, cutting overall runtime and resource consumption by 50%. The custom tools and processes developed for this effort—Polyglot Piranha, Iron Dome, and Cadence‑driven orchestration—provide a reusable blueprint for large‑scale Spark version upgrades.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Past Memory Big Data
A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
