Big Data 13 min read

Iceberg Table Format Practice in Huawei Terminal Cloud

This article explains how Huawei's terminal cloud adopts the Apache Iceberg table format to efficiently manage large-scale datasets, detailing its architecture, feature engineering, merge operations, LSM-based storage, schema versioning, AB testing support, catalog enhancements, and future roadmap for full lifecycle data governance.

DataFunSummit
DataFunSummit
DataFunSummit
Iceberg Table Format Practice in Huawei Terminal Cloud

Iceberg is a distributed data‑lake table format that supports row‑level updates, transactions, and snapshots, enabling efficient query processing for massive datasets. It has been widely adopted in many internet companies.

The article is organized into three parts: an overview of the terminal business, the practical use of Iceberg in Huawei Terminal Cloud, and future plans and outlook.

1. Terminal Business Overview – Huawei Terminal Cloud provides an all‑scenario smart experience based on the "1+8+N" ecosystem, continuously innovating and upgrading the platform to deliver secure, intelligent, convenient, rich, and personalized digital life experiences.

2. Iceberg Usage in Huawei Terminal Cloud

Iceberg addresses several pain points in feature engineering, such as feature duplication, inconsistent naming, high resource consumption for long‑cycle features, and latency/transaction issues in wide‑table updates. By leveraging Iceberg, Huawei builds an offline feature processing paradigm where each primary key has a dedicated feature table, eliminating duplicate construction and enabling efficient aggregation of long‑cycle data.

Traditional Hive lacks robust row‑level update capabilities; Iceberg’s Merge Into operation provides a simpler, more performant way to handle updates, reducing dependency on upstream tasks and improving fault tolerance.

However, Merge Into has limitations in real‑time scenarios (e.g., minute‑level updates for GMV) and can generate many small files under frequent commits.

To mitigate this, a Log‑Structured Merge‑Tree (LSM) based file update method is introduced. Data is first written to an in‑memory MemTable , which becomes an immutable MemTable and is flushed to disk as a Sorted String Table ( SSTable ). Periodic compaction merges SSTables, reducing file count and read overhead.

Schema changes are handled via automatic JSON/MAP mapping and Flink stream listeners. When a schema change occurs, Flink pauses the stream, flushes existing data, and rewrites new JSON/MAP data according to the updated schema. Example initial schema: {id: int, name: string} .

AB testing creates branches on Iceberg tables; to avoid duplicated writes, a virtual Schema+Branch mapping (Schema Version) is designed, allowing branches to share underlying files while maintaining separate schema subsets.

Catalog enhancements replace Hive Metastore with a REST Catalog, decoupling metadata access and enabling cache‑based sharing of metadata across tasks.

Future directions include evolving Iceberg from a table format to a full‑lifecycle data‑governance platform with service, operation, and security rule engines, supporting features such as automated partition management, Z‑Order optimization, intelligent small‑file merging, metadata caching, SLA reporting, and low‑efficiency task detection.

Overall, the architecture demonstrates Iceberg’s transition toward a configurable, rule‑driven data governance solution that integrates streaming, batch, and hybrid read modes while addressing performance, consistency, and manageability challenges.

Big DataStreamingData LakeIcebergHuawei CloudSchema Management
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.