Big Data 13 min read

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

DataFunSummit

May 4, 2025

Iceberg Table Format Practice in Huawei Terminal Cloud

Iceberg is a distributed data‑lake table format that supports row‑level updates, transactions, and snapshots, enabling efficient query processing for massive datasets. It has been widely adopted in many internet companies.

The article is organized into three parts: an overview of the terminal business, the practical use of Iceberg in Huawei Terminal Cloud, and future plans and outlook.

1. Terminal Business Overview – Huawei Terminal Cloud provides an all‑scenario smart experience based on the "1+8+N" ecosystem, continuously innovating and upgrading the platform to deliver secure, intelligent, convenient, rich, and personalized digital life experiences.

2. Iceberg Usage in Huawei Terminal Cloud

Iceberg addresses several pain points in feature engineering, such as feature duplication, inconsistent naming, high resource consumption for long‑cycle features, and latency/transaction issues in wide‑table updates. By leveraging Iceberg, Huawei builds an offline feature processing paradigm where each primary key has a dedicated feature table, eliminating duplicate construction and enabling efficient aggregation of long‑cycle data.

Traditional Hive lacks robust row‑level update capabilities; Iceberg’s Merge Into operation provides a simpler, more performant way to handle updates, reducing dependency on upstream tasks and improving fault tolerance.

However, Merge Into has limitations in real‑time scenarios (e.g., minute‑level updates for GMV) and can generate many small files under frequent commits.

To mitigate this, a Log‑Structured Merge‑Tree (LSM) based file update method is introduced. Data is first written to an in‑memory MemTable, which becomes an immutable MemTable and is flushed to disk as a Sorted String Table ( SSTable). Periodic compaction merges SSTables, reducing file count and read overhead.

Schema changes are handled via automatic JSON/MAP mapping and Flink stream listeners. When a schema change occurs, Flink pauses the stream, flushes existing data, and rewrites new JSON/MAP data according to the updated schema. Example initial schema: {id: int, name: string}.

AB testing creates branches on Iceberg tables; to avoid duplicated writes, a virtual Schema+Branch mapping (Schema Version) is designed, allowing branches to share underlying files while maintaining separate schema subsets.

Catalog enhancements replace Hive Metastore with a REST Catalog, decoupling metadata access and enabling cache‑based sharing of metadata across tasks.

Future directions include evolving Iceberg from a table format to a full‑lifecycle data‑governance platform with service, operation, and security rule engines, supporting features such as automated partition management, Z‑Order optimization, intelligent small‑file merging, metadata caching, SLA reporting, and low‑efficiency task detection.

Overall, the architecture demonstrates Iceberg’s transition toward a configurable, rule‑driven data governance solution that integrates streaming, batch, and hybrid read modes while addressing performance, consistency, and manageability challenges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Streaming Data Lake Iceberg Huawei Cloud Schema Management

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.