Applying Spark SQL at Ping An Insurance: Business Background, Deployment Choices, Migration Process, and Lessons Learned
This article details how Ping An Insurance migrated its offline Hive SQL workloads to Spark SQL, covering business background, deployment mode selection, migration workflow, typical challenges, optimization measures, and the resulting performance and resource utilization improvements.
In this technical sharing, Li Weixuan, a big data development engineer at China Ping An, presents the application of Spark SQL within Ping An Property & Casualty insurance.
Business background : During peak periods the offline cluster suffers from CPU saturation while memory usage remains low, leading to resource inefficiency for high‑concurrency Hive SQL tasks that process hundreds of gigabytes to tens of billions of rows.
Business requirements : Reduce cluster resource consumption, improve resource utilization, and accelerate job execution while maintaining data freshness (T+1).
Engine comparison : Hive offers high stability but low timeliness and memory utilization; Spark SQL improves timeliness by eliminating redundant HDFS reads and MapReduce stages, achieving 70‑80% performance gains with 30‑40% of Hive’s CPU resources, though it has lower stability for very large jobs; Presto provides low‑latency, in‑memory queries but requires high memory and is suited for interactive workloads.
Implementation plan : Switch eligible Hive SQL tasks (data size < 60 billion rows / < 500 GB, simple SQL, low stability requirements) to Spark SQL, while retaining other tasks in Hive.
Deployment mode selection :
ThriftServer mode – single long‑running application shared by all SQL, high resource utilization but low isolation and no HA.
Spark SQL mode – each SQL launches a separate application, offering isolation and higher stability at the cost of lower utilization and startup overhead.
Kyuubi mode – HA‑enabled ThriftServer variant with multi‑tenant support, providing better stability and isolation.
Ping An adopts Spark SQL mode for large batch jobs and Kyuubi mode for high‑frequency reporting and development queries.
Migration workflow :
Perform syntax compatibility analysis using an automated tool that flags incompatibilities and suggests fixes.
Run test jobs and compare results with production to verify data consistency, resource usage, and timeliness.
Proceed to gray‑deployment where both Hive and Spark versions run in parallel for a week, collecting execution time, resource consumption, and error metrics.
Based on gray results, promote successful jobs to production.
Typical problems and solutions :
ThriftServer stability – increase driver memory, lower broadcast join thresholds, split services, batch result returns, limit large result sets, and adjust concurrency.
Parallelism tuning – configure Spark.sql.shuffle.partitions and Spark.default.parallelism appropriately for the workload.
Communication and retry parameters – reduce retry counts, increase timeouts, and adjust heartbeat intervals to avoid long‑running failed tasks.
Resource estimation – size executors based on node manager CPU, calculate executor count from task profiling, and set driver memory according to expected result set size.
Broadcast join thresholds – adjust Spark.sql.autoBroadcastJoinThreshold to balance memory usage and join performance.
Compatibility issues – handle insert‑overwrite self‑overwrites by disabling Hive metastore conversion or using data‑lake formats like Hudi/Iceberg, and set Spark.sql.storeAssignmentPolicy=LEGACY for permissive type casting.
Migration results : Queue resource usage dropped by 30‑60%; Spark SQL mode achieved 65‑90% performance improvement, while ThriftServer mode saw 70‑90% gains. Overall, the migration significantly reduced CPU core requirements and accelerated job execution.
Thank you for listening.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.