Big Data 14 min read

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

This article shares Xiaomi's real‑world challenges and solutions when building a new Spark 3.1‑based data platform, covering Multiple Catalog implementation, Hive‑to‑Spark SQL migration, automated batch upgrades, performance and stability optimizations, and future roadmap for vectorized execution.

DataFunTalk

Apr 9, 2024

Practical Experience and Solutions for Migrating and Optimizing Spark 3.1 in Xiaomi’s One‑Stop Data Development Platform

Apache Spark is a widely used offline big‑data engine; Xiaomi built a next‑generation one‑stop data development platform on Spark 3.1 and encountered issues during job migration, performance tuning, and stability improvement.

1. Multiple Catalog implementation – Introduced a unified catalog based on Metacat to manage heterogeneous data sources (Hive, MySQL, Kudu, Doris, Iceberg) without Hive‑only limitations, enabling federated queries across multiple sources in a single SQL statement.

2. Hive SQL to Spark SQL migration – Conducted a four‑step process (syntax check with Kyuubi PlanOnly, data consistency verification via double‑run, automated engine‑version upgrade, and post‑upgrade monitoring). Addressed schema mismatches, temporary view/function handling, and small‑file problems by adding repartition steps and configuring isolation.

3. Batch automated upgrade – Prioritized upgrades by task type (DDL before DML), priority, runtime, and cluster size, reducing risk and achieving a rise of Spark 3 SQL usage from 51% to 90% with a 32% average efficiency gain.

4. Performance optimization – Applied data‑skipping (partition, file, RowGroup, column pruning) on Iceberg/Parquet, optimized predicate ordering, introduced page‑level min‑max indexes, and refined join strategies (BroadcastHashJoin, ShuffledHashJoin, SortMergeJoin) based on AQE metrics, resulting in up to 93% speedup for filtered queries and overall 20% of SQL gaining 38% performance.

5. Stability optimization – Resolved driver OOM caused by large permission tables by moving from local Ranger cache to centralized Ranger service, improving driver throughput from 68% to 95%.

6. Future roadmap – Planning to adopt a vectorized engine (Spark 3.3 + Gluten 0.5 + Velox) with expected TPC‑DS performance improvements of 43%‑69% and further user‑experience enhancements such as automatic job‑configuration tuning.

7. Q&A highlights – Discussed Spark federated query syntax, Iceberg vs. Hudi adoption, Hive‑to‑Spark compatibility issues, repartition and bucket functions in Iceberg, and configuration recommendations for Gluten.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration Performance Optimization Big Data Hive Apache Spark Spark SQL Metacat

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.