Big Data 13 min read

How Xiaomi Cuts Costs and Boosts Efficiency with a Cloud‑Native Lakehouse Architecture

Xiaomi’s data‑lake team explains how they tackled small‑file issues, unified metadata with Gravitino, migrated Hive to Iceberg and Fileset, leveraged JuiceFS for multi‑cloud storage, and combined Iceberg and Paimon to achieve cost‑effective, high‑performance batch and real‑time analytics.

DataFunTalk
DataFunTalk
DataFunTalk
How Xiaomi Cuts Costs and Boosts Efficiency with a Cloud‑Native Lakehouse Architecture

Xiaomi solves the data‑lake small‑file problem by using built‑in compact and service‑hosted compact methods, and by employing layered metadata to accelerate queries and improve scalability. Gravitino provides unified metadata governance, enabling a smooth migration from Hive to Iceberg/Fileset while JuiceFS offers a storage abstraction layer that controls costs and ensures data security across multiple clouds.

In the architecture, Iceberg is dedicated to batch processing workloads, whereas Paimon handles real‑time update scenarios. To address cloud migration challenges, a tiered cache strategy significantly reduces object‑storage API costs, and a phased migration plan safeguards system stability.

During the interview, DataFun asks how DataFun’s experience with Apache IoTDB and TsFile informs Xiaomi’s lake‑house design, especially regarding small files and metadata management. The answer highlights IoTDB’s millisecond‑level low‑latency writes and its memory‑buffer, batch‑flush, and asynchronous merge strategies that systematically address the small‑file issue. TsFile’s self‑contained metadata design inspires similar query‑acceleration techniques in the lake.

When migrating Hive tables, the team recommends upgrading to Iceberg for SQL‑centric workloads without HDFS directory access, providing a documented SOP for migration. For tables requiring HDFS directory access or diverse file formats (Parquet, JSON, CSV, etc.), Fileset is suggested, allowing directory‑style reads with minimal business changes. Both Iceberg and Fileset benefit from JuiceFS as the underlying storage, further reducing costs.

Gravitino plays a critical role as a unified metadata governance layer, consolidating metadata for Iceberg, Fileset, and non‑tabular assets, offering a single access point and permission view that dramatically lowers management complexity and migration overhead.

Iceberg and Paimon complement each other: Iceberg stores large‑scale batch analytics data (DWD/DWS) with a mature ecosystem, while Paimon targets low‑latency, high‑frequency update scenarios such as real‑time user profiles and CDC streams, leveraging its LSM‑Tree architecture. The choice between them hinges on update frequency and latency requirements—hour‑ or day‑level batch updates favor Iceberg, whereas minute‑level row‑level updates or changelog consumption favor Paimon.

Multi‑cloud challenges are addressed by JuiceFS, which abstracts away object‑storage API differences with a unified HDFS interface, enabling “write‑once, run‑anywhere” code. Fine‑grained IAM integration ensures secure, compliant data access across clouds.

Future directions for lake formats include intelligent optimization (automatic partition pruning, indexing, hot‑cold tiering, and self‑tuning), real‑time visibility (moving from minute‑level to second‑ or millisecond‑level latency), and multimodal support (handling images, audio, video alongside structured data) to close the loop for AI training and inference.

Cost control is achieved through visual dashboards that monitor storage spend and usage, multi‑level caching (local disk and distributed caches) to cut object‑storage API calls, and a dual‑write migration strategy that allows immediate rollback to Hive if issues arise. The migration proceeds in stages: first stable batch workloads, then offline jobs, and finally real‑time tasks, each with risk mitigation measures.

Collaboration between the data‑lake team (led by Qí Hòuliàng) and the storage team (led by Wáng Cháohuī) was essential. They coordinated dual‑write pipelines, incremental rollouts, and fallback mechanisms, ensuring a controlled, low‑risk transition to the cloud‑native lakehouse.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeBig DataPaimonData Lakemetadata managementIcebergJuiceFS
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.