Big Data 14 min read

How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture

Xiaomi’s engineers explain how they tackled data‑lake challenges—small files, metadata latency, and multi‑cloud costs—by combining compact storage, Gravitino‑based metadata governance, Iceberg and Paimon formats, and JuiceFS abstraction, achieving lower storage expenses, faster queries, and a roadmap toward intelligent, real‑time, multimodal lakehouses.

DataFunSummit
DataFunSummit
DataFunSummit
How Xiaomi Cuts Costs and Boosts Performance with Cloud‑Native Data Lake Architecture

At the 2024 DACon Digital Intelligence Conference in Beijing, Xiaomi storage R&D engineer Wang Chaohui and data lake R&D engineer Qi Houliang shared their practice of building a universal data lake architecture that reduces costs and improves efficiency.

DataFun: How does your experience with Apache IoTDB and TsFile inform your approach to handling small files and metadata in Xiaomi's data lake?

Qi Houliang: Time‑series data has inherent streaming characteristics; IoTDB provides millisecond‑level low latency, which is crucial for real‑time alerts. We apply strategies such as in‑memory buffering, batch flushing, and asynchronous background merging to systematically solve the small‑file problem. TsFile’s self‑contained metadata design inspires similar layered metadata in Iceberg/Paimon, enabling query acceleration without relying on a centralized metadata service.

DataFun: What compatibility issues did you encounter when migrating Hive workloads to Iceberg, and how does Gravitino help reduce migration cost?

Qi Houliang: For Hive tables whose downstream is mainly SQL access without HDFS directory or special file‑format requirements, we recommend upgrading to Iceberg and have defined an SOP for migration. For tables that need HDFS directory access or formats such as Parquet, JSON, CSV, Text, TFRecord, or SequenceFile, we migrate to Fileset, which preserves the original file layout and requires minimal changes. Both Iceberg and Fileset run on JuiceFS, further lowering storage cost and motivating migration.

Gravitino serves as a unified metadata governance layer, managing metadata of Iceberg, Fileset, and non‑tabular assets, providing a single access point and permission view, thus greatly simplifying data management and reducing migration complexity.

DataFun: How do Iceberg and Paimon divide responsibilities in Xiaomi’s workloads, and what criteria guide the choice of format?

Qi Houliang: Iceberg handles large‑scale batch‑oriented analytical workloads (DWD/DWS) with a mature ecosystem, while Paimon focuses on low‑latency, high‑frequency update scenarios such as real‑time user profiles and CDC, thanks to its LSM‑Tree structure. The key decision factor is update frequency and latency requirements: hourly or daily batch updates favor Iceberg; minute‑level row‑level updates or changelog consumption with primary‑key needs favor Paimon.

DataFun: How does JuiceFS enable multi‑cloud support and what challenges does it address?

Qi Houliang: JuiceFS abstracts storage, offering a unified HDFS‑compatible interface that hides object‑store API differences, allowing the same code to run across clouds. It also integrates with Xiaomi’s IAM for fine‑grained access control, ensuring security and compliance in multi‑cloud environments.

DataFun: What future directions do you see for data lake formats?

Qi Houliang: Three trends: intelligent optimization (automatic partition pruning, indexing, hot‑cold tiering, and auto‑tuning), real‑time visibility (moving from minute‑level to second or millisecond latency), and multimodal support (handling structured, semi‑structured, and media data for AI training).

DataFun: How do you measure and control hidden API‑call costs when migrating from HDFS to JuiceFS + object storage?

Wang Chaohui: We build visual dashboards that monitor daily storage spend and usage, calculate per‑unit cost, and investigate any upward trend. To curb API‑call cost we employ multi‑level caching (local disk and distributed cache) to reduce object‑store accesses.

DataFun: How is the cache strategy designed, and is there dynamic hotspot detection?

Wang Chaohui: Our cache uses an LRU‑like eviction; when usage reaches a threshold, older entries are asynchronously evicted. Dynamic hotspot detection is not yet implemented; hot tables are identified manually and directed to a shared cache cluster.

DataFun: How do you optimize cross‑cloud bandwidth consumption?

Wang Chaohui: We mount JuiceFS volumes to different buckets to avoid bandwidth throttling, and use distributed cache pre‑loading for hot data, reducing peak traffic and bandwidth usage.

DataFun: What were the biggest technical challenges in moving to a cloud‑native lakehouse, and how were they solved?

Wang Chaohui: Ensuring dual‑write during Hive‑to‑Iceberg migration for safe rollback, and gaining user confidence in JuiceFS stability by migrating in stages: first back‑trackable workloads, then offline jobs, and finally real‑time tasks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big Datamulti-cloudStorage OptimizationData Lakemetadata management
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.