Big Data 18 min read

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi migrated over 10,000 Hive SQL tasks to Spark SQL using a lightweight dual‑run pipeline that extracts, rewrites, compares, and switches tasks, fixing syntax and UDF differences while adding features such as small‑file merging and enhanced partition pruning, resulting in Spark handling 85 % of workloads with 40 % faster execution, 21 % lower CPU and 49 % lower memory usage.

Didi Tech

Jan 25, 2021

Migrating Hive SQL to Spark SQL: Design, Implementation, and Performance Evaluation at DiDi

DiDi began using Spark in 2015 mainly for data mining and machine learning, while Hive remained the primary engine for data warehouse SQL. To improve performance and stability, the team decided to migrate Hive SQL tasks to Spark SQL.

The migration faced three main problems: slow Hive SQL execution (average 20 minutes per task), unstable HiveServer2 processes, and duplicated effort required to maintain two engines. The goal was to increase the share of Spark SQL tasks, reduce execution time, and save compute resources.

Two migration approaches were evaluated: (1) deploying a separate SQL execution system for Spark, isolated from production, and (2) building a dual‑run tool that executes the same SQL on both Hive and Spark. The second approach was chosen because it is lightweight and does not require additional physical resources.

The migration pipeline consists of four stages:

Hive SQL extraction : modify HiveHistoryImpl to store all SQL statements per session, upload daily logs to HDFS, and parse them with a HiveHistoryParser that deduplicates and merges SQL files.

SQL rewrite & dual‑run : analyze each SQL line with Spark’s SessionState to detect INSERT OVERWRITE or CREATE TABLE AS SELECT, create corresponding test tables, rewrite the target table names, and generate two versions of each SQL (Hive and Spark).

Result comparison : run both versions concurrently, record application IDs, execution times, and resource usage, then compare output tables for row count, file count/size, and column values. Differences are classified into categories such as “Migratable”, “Experience‑migratable”, “Data inconsistent”, “Time‑high”, “CPU‑high”, “Memory‑high”, “Files‑high”, syntax incompatibility, and runtime exceptions.

Migration : for tasks classified as migratable, change the task type to SparkSQL via DataStudio API and re‑run them.

During the process, numerous engine differences were discovered:

Syntax differences : certain Hive constructs cause parsing errors in Spark; many have been fixed upstream.

UDF differences : functions like collect_set, collect_list, date/time functions, and floating‑point aggregations produce different results due to ordering or null handling. The team added compatibility layers and configuration switches.

UDF execution environment : Hive runs UDFs in a single‑process task, while Spark may run them concurrently across executors, requiring thread‑safe implementations (e.g., adding locks or removing static state).

Performance & feature gaps : Hive supports small‑file merging and client‑mode execution, while Spark lacked these. The team implemented a small‑file merge feature in Spark and added a cluster‑mode driver to avoid resource hotspots.

Partition pruning : Hive can prune partitions using complex expressions (concat, substr, etc.), whereas Spark originally only supported simple predicates. The team extended Spark’s partition pruning to handle these cases, covering >90 % of production workloads.

After six months, more than 10 000 Hive tasks were migrated. Spark SQL now accounts for 85 % of SQL workload, with a 40 % reduction in execution time, 21 % less CPU, and 49 % less memory usage. The team plans to migrate shell‑type Hive tasks, improve Spark External Shuffle Service, and upgrade to Spark 3.x.

The article also includes a recruitment call for big‑data engineers (Flink, ClickHouse, ElasticSearch, HDFS, Presto, etc.) at DiDi.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance SparkSQL Hive UDF bigdata DataMigration SQLOptimization

Written by

Didi Tech

Official Didi technology account

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.