Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform
This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.
Apache Spark is a widely used offline big‑data engine; Xiaomi built a next‑generation one‑stop data development platform on Spark 3.1 and encountered issues during job migration, performance tuning, and stability improvement.
1. Multiple Catalog implementation – Introduced a unified catalog based on Metacat to manage heterogeneous data sources (Hive, MySQL, Kudu, Doris, Iceberg) without Hive‑only limitations, enabling federated queries across multiple sources in a single SQL statement.
2. Hive SQL to Spark SQL migration – Conducted a four‑step process (syntax check with Kyuubi PlanOnly, data consistency verification via double‑run, automated engine‑version upgrade, and post‑upgrade monitoring). Addressed schema mismatches, temporary view/function handling, and small‑file problems by adding repartition steps and configuring isolation.
3. Batch automated upgrade – Prioritized upgrades by task type (DDL before DML), priority, runtime, and cluster size, reducing risk and achieving a rise of Spark 3 SQL usage from 51% to 90% with a 32% average efficiency gain.
4. Performance optimization – Applied data‑skipping (partition, file, RowGroup, column pruning) on Iceberg/Parquet, optimized predicate ordering, introduced page‑level min‑max indexes, and refined join strategies (BroadcastHashJoin, ShuffledHashJoin, SortMergeJoin) based on AQE metrics, resulting in up to 93% speedup for filtered queries and overall 20% of SQL gaining 38% performance.
5. Stability optimization – Resolved driver OOM caused by large permission tables by moving from local Ranger cache to centralized Ranger service, improving driver throughput from 68% to 95%.
6. Future roadmap – Planning to adopt a vectorized engine (Spark 3.3 + Gluten 0.5 + Velox) with expected TPC‑DS performance improvements of 43%‑69% and further user‑experience enhancements such as automatic job‑configuration tuning.
7. Q&A highlights – Discussed Spark federated query syntax, Iceberg vs. Hudi adoption, Hive‑to‑Spark compatibility issues, repartition and bucket functions in Iceberg, and configuration recommendations for Gluten.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.