Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL
iQIYI accelerated its big‑data platform by migrating the OLAP layer from Hive to Spark SQL, achieving a 67 % speedup, 50 % CPU reduction and 44 % memory savings, while automating the conversion of tens of thousands of tasks and delivering faster analytics for advertising, BI, membership and user‑growth services.
Since 2012 iQIYI has built a comprehensive big‑data platform covering data collection, processing, analysis and application, which supports the company’s operational decisions and various data‑intelligence services. With the continuous growth of data volume and the increasing complexity of computations, quickly extracting the hidden value of data has become a major challenge.
Starting in 2020 the big‑data team launched an acceleration project to meet real‑time analysis demands. By switching the OLAP layer from Hive to Spark SQL, task execution speed increased by 67 % and resource consumption dropped by 50 %, bringing efficiency gains to BI, advertising, membership and user‑growth business lines.
Background
In the early stage the platform was built on the Hadoop ecosystem, using Hive as the primary offline analysis engine. Hive provides a SQL interface on top of HDFS, but its performance is relatively slow, especially for large‑scale, complex queries. As new latency‑sensitive services such as ad‑bid optimization, feed recommendation, real‑time membership operations and user‑growth analytics emerged, Hive‑only offline analysis could no longer meet timeliness requirements. Although faster OLAP engines such as Trino and ClickHouse were introduced, they still depend on the Hive‑based data warehouse and upstream cleaning processes, making it essential to improve Hive’s processing performance.
Solution Selection
The team evaluated several alternatives—Hive on Tez, Hive on Spark and Spark SQL—across compatibility, performance, stability and migration cost. The final choice was Spark SQL.
Hive on Tez : retains Hive SQL syntax and offers a plug‑in execution engine that reduces intermediate data I/O, but it suffers from poor parallelism on large datasets, limited community activity and high operational overhead when failures occur.
Hive on Spark : also allows a seamless switch of the execution engine, yet it only supports Spark 2.3 and earlier, delivers sub‑optimal performance because SQL is still parsed by Hive Calcite, and faces community inactivity and inflexible resource allocation.
Spark SQL : provides Hive‑compatible SQL, uses the Hive Metastore, and runs queries as native Spark jobs with in‑memory computation, resulting in much lower disk I/O and higher execution efficiency. It can smoothly replace most existing Hive tasks.
The comparison table (shown in the original article) demonstrates that Spark SQL best fits the current scenario.
Technical Transformation
Migrating from Hive to Spark SQL involves several challenges: compatibility adjustments, SQL syntax changes, data‑consistency guarantees, system integration and dependency refactoring.
Spark compatibility modifications : UDF multithreading – converted SimpleDateFormat usage to ThreadLocal to avoid errors. grouping_id support – added automatic translation from Hive’s grouping_id to Spark’s grouping_id(). Parameter mapping – aligned Hive‑specific parameters with their Spark equivalents.
Enabling Spark new features and configuration optimizations : Dynamic Resource Allocation (DRA) – Executors are automatically requested or released based on runtime demand, reducing idle resources. Adaptive Query Execution (AQE) – collects runtime statistics to dynamically merge small shuffle partitions, choose optimal join strategies and handle data skew. Automatic small‑file merging – inserts a Rebalance operator combined with AQE to coalesce small partitions and split large ones.
Spark architecture improvements : Replaced the single‑user Spark ThriftServer with Apache Kyuubi, enabling independent SparkSession per user, multi‑tenant isolation, queue and resource isolation, and service‑oriented capabilities. Label‑based configuration – predefined tags trigger specific engine settings for ad‑hoc queries, ETL jobs, etc. Concurrency limits – per‑user and per‑IP connection throttling to protect the service. Event collection – exposed execution events for SQL auditing, anomaly analysis and optimization feedback.
Automation Migration Tool
To avoid manually converting tens of thousands of Hive tasks, the team built an automated migration engine on top of the internally developed Pilot SQL platform. The workflow includes:
Collect Hive task metadata (SQL, queue, workflow name) via Pilot.
Parse Hive SQL with SparkParser to extract input/output tables.
Create mapping tables for dual‑run outputs, keeping them isolated from production tables.
Replace the execution engine of the dual‑run tasks with Spark SQL.
Run both Hive and Spark versions of the task and write results to the mapping tables.
Perform consistency checks by comparing row counts and CRC32 checksums (including handling of collection‑type fields and floating‑point precision).
If Spark SQL fails, automatically fall back to Hive to guarantee task completion.
Images in the original article illustrate the dual‑run architecture and checksum calculation.
Migration Effects
After a period of effort, about 90 % of Hive tasks have been smoothly migrated to Spark SQL, yielding a 67 % performance boost, 50 % reduction in CPU usage and 44 % reduction in memory consumption. Business‑level impacts include:
Advertising: offline task performance up ~38 %, resource saving 30 %, efficiency up 20 %.
BI: total execution time down 79 %, resource saving 43 %, P0 reports delivered 30‑60 minutes earlier.
User growth: data production 2 hours earlier, enabling core reports before 10 am.
Membership: order data produced 8 hours earlier, analysis speed up >10×.
iQIYI account: average execution time reduced 40 %, saving ~100 hours per day.
Future Plans
Upgrade migration tool : add automatic error extraction, root‑cause tagging and auto‑rewrite of incompatible Hive SQL.
Engine optimizations : address storage bloat from small‑file repartitioning, mitigate DPP‑induced slow parsing, and enrich Spark execution metrics (shuffle size, data skew, data expansion, etc.).
Simulation testing engine : evolve the dual‑run capability into a standalone service for generic data‑replay testing during version upgrades, parameter tuning and cluster migrations.
iQIYI Technical Product Team
The technical product team of iQIYI
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.