How Didi Boosted SQL Performance by 40%: Migrating 10k Hive Jobs to Spark
Didi migrated over 10,000 Hive SQL tasks to Spark SQL, achieving 85% Spark task share, cutting execution time by 40%, and reducing CPU and memory usage by 21% and 49% respectively, through a systematic migration process that addressed syntax, UDF, performance, and functional differences between the two engines.
Background and Motivation
Didi has been using Hive SQL for its data warehouse workloads while Spark has been adopted mainly for data mining and machine learning. As Hive SQL tasks grew to dominate the offline workload, several problems emerged: slow execution (average 20 minutes per task), unstable HiveServer2 processes, and duplicated effort because two engines required separate manpower.
To achieve faster, more stable SQL processing and consolidate engineering resources, Didi decided to migrate Hive SQL jobs to Spark SQL.
Migration Design Overview
The migration had three core requirements: (1) data consistency – identical results from Hive and Spark for the same query; (2) resource savings – Spark should reduce execution time, CPU, and memory usage; (3) transparency – the migration should be invisible to end users.
Two implementation options were evaluated:
Option 1: Deploy a parallel SQL execution system isolated from production, synchronizing metadata and converting Hive jobs to Spark jobs.
Option 2: Build a “SQL dual‑run” tool that collects, rewrites, and executes each SQL on both Hive and Spark for comparison.
After weighing pros and cons (resource overhead and deployment complexity for Option 1 versus lightweight development for Option 2), the team chose Option 2.
Detailed Migration Design
1. Hive SQL Extraction
Implemented a custom HiveHistoryImpl to record every SQL statement executed in a Hive session into daily local files on the HiveServer2 host. A periodic job uploads these files to HDFS. A HiveHistoryParser then parses the files, deduplicates SQLs, merges sessions belonging to the same shell task, and produces raw SQL files together with metadata (task name, execution ID, user, etc.).
2. SQL Rewriting & Dual‑Run
The rewriting step scans each raw SQL file with Spark’s SessionState to detect INSERT OVERWRITE or CREATE TABLE AS SELECT statements, extracts target table names, and creates two identical test tables (one for Hive, one for Spark) in a dedicated test database. The original table names are replaced with the test tables, producing two rewritten SQL files per original query.
Both rewritten queries are then executed concurrently on Hive and Spark. Execution metadata – application IDs, start/end timestamps, and resource consumption – is recorded.
3. Result Comparison
For each dual‑run record the system compares:
Row counts of the output tables.
Number and size of generated HDFS files.
Column‑wise data equality; for numeric columns it computes the maximum precision difference.
vCore and memory usage of the Spark and Hive applications.
The comparison outcome classifies each task into categories such as “Migratable”, “Experience‑Migratable” (differences are harmless, e.g., order of collect_set), “Data Inconsistent”, or performance‑related flags like Time_High, Cpu_High, Memory_High, Files_High, Syntax_Incompatible, and Runtime_Error.
4. Migration Execution
The migration workflow consists of:
Collecting the list of tasks to be migrated and their configuration.
Calling DataStudio APIs to change the task type from Hive to Spark.
Re‑running the tasks.
5. Issue Diagnosis & Fixes
Tasks classified as “Migratable” or “Experience‑Migratable” are migrated directly. Others require investigation – the most time‑consuming part of the migration. After fixing issues, the extraction‑rewrite‑dual‑run cycle is repeated until the task meets migration criteria.
Engine Differences
1. Syntax Gaps
Several Hive constructs fail to parse in Spark (e.g., certain CREATE TABLE variants). Some have been fixed upstream; others remain open issues.
2. UDF Behaviour
Differences arise from input order, handling of exceptional values, and execution environment. Examples include nondeterministic ordering of collect_set and collect_list, floating‑point precision variations in sum(double), and null returns for invalid dates in Spark versus Hive.
These nondeterministic differences are considered harmless and classified as “Experience‑Migratable”. For deterministic mismatches (e.g., datediff on illegal dates), Spark’s null result is more correct, but Didi provides a compatibility layer allowing users to choose Hive‑style behaviour.
3. UDF Execution Model
Hive runs a UDF in a single‑threaded MR task, while Spark may execute the same UDF concurrently across many executors, exposing thread‑safety issues. The team demonstrated a non‑thread‑safe UDF that shares a static variable and provided two remedies: adding explicit locks or removing static state.
4. Performance & Feature Gaps
Small‑file merging: Hive can merge small output files via a secondary MR job; Spark lacked this feature, leading to many more output files. Didi contributed a small‑file‑merge implementation to Spark’s InsertIntoHiveTable.
Cluster mode support: Hive jobs run via beeline (client‑only), while Spark originally only supported client mode, causing driver resource hotspots. Didi added a non‑interactive cluster mode for Spark SQL.
Partition pruning: Hive can prune partitions using complex expressions (e.g., concat, substr). Spark’s pruning was limited to simple predicates, causing full‑partition scans and MySQL metadata pressure. Didi extended Spark to support combined concat/concat_ws and substr pruning, covering >90% of production cases.
Migration Results
After more than six months, over 10,000 Hive SQL tasks were migrated to Spark. Execution time per task dropped from >1000 seconds to ~600 seconds. Spark now accounts for 85% of SQL workload, with overall SQL runtime reduced by 40%, CPU usage by 21%, and memory usage by 49%.
Future Plans
The next steps include migrating shell‑type tasks that still invoke Hive, optimizing Spark’s External Shuffle Service to reduce fetch retries, and upgrading the production stack to Spark 3.x to benefit from the latest performance and feature improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
