Exploring Iceberg in Huawei Terminal Cloud: Architecture, Features, and Future Plans
This article presents a comprehensive overview of Iceberg's adoption in Huawei Terminal Cloud, covering its architectural overview, key features such as Git‑style data management, real‑time processing, acceleration layers, and future development directions, along with a Q&A session addressing performance and implementation details.
Iceberg is a widely used table format framework in lakehouse architectures, already applied in many enterprises. This talk shares Huawei Terminal Cloud's exploration and practice of Iceberg, organized into three parts: overall overview, feature applications, and future plans.
01 Overall Overview
The evolution of data analysis is described: initially focused on structured data in data warehouses (e.g., Oracle) with BI reporting; then big‑data frameworks introduced semi‑structured data and expanded to data science, machine learning, and real‑time processing, still split between warehouses and clusters; finally, lakehouse architectures emerged, unifying storage for comprehensive analysis.
Traditional Hive catalog solutions define tables as directories, storing metadata and partitions in the Hive metastore. Advantages include broad engine compatibility, partition‑level atomicity, storage‑format independence, and ecosystem metadata description. Disadvantages are low efficiency for small schema changes, unsafe multi‑partition modifications, long directory listings for large tables, need to understand physical layout, outdated statistics, and poor performance on object storage.
To address these issues, a shift from directory‑level to file‑level tracking is proposed, aiming to ensure table correctness and consistency, faster planning and execution, transparent physical structure, support for schema evolution, and scalable performance on large data volumes.
02 Feature Applications
1. Git‑style Data Management Iceberg enables linear snapshot history with branches and tags, allowing data reads/writes similar to Git code management. Stable, compacted snapshots can be generated periodically, and historical snapshots can be labeled for safe back‑filling. Experiments can be performed on separate branches without affecting the main data, and later merged back, reducing storage redundancy via incremental storage.
2. Real‑time By adding a LogStore module that leverages message queues alongside file storage, Iceberg tables achieve second‑level latency. Data can be written in real time and read in three modes: full snapshot, incremental (CDC) from LogStore, and hybrid (snapshot + recent LogStore data). Challenges include fine‑grained snapshot conflict resolution and disaster recovery between LogStore and file storage.
3. Acceleration Layer Using technologies like Alluxio, a cold/hot tiered acceleration layer places recent data in fast storage and older data in cheaper storage, improving query speed and reducing cluster load.
4. Flink Unify Sink Based on Flink FLIP‑143/191, the Iceberg Flink sink is refactored to generate index information after task submission and to merge small files in real time, enhancing query acceleration.
5. Other Enhancements Ongoing work includes secondary index support for primary‑key lookups, adding Flink‑specific features (e.g., snapshot management, data rewrite), and improving column‑level updates via specially constructed files with pseudo‑primary‑key metadata.
03 Future Plans
Faster : Introduce more indexing and optimization techniques to boost read/write performance, especially for real‑time scenarios.
Richer : Expand usage in advertising, recommendation, and feature engineering, addressing needs such as partial column updates and upserts.
Easier to Use : Productize the solution so users can manage snapshots and file governance without directly invoking Iceberg APIs.
A Q&A session follows, covering topics such as query performance with massive deletions (recommend Merge‑On‑Read with asynchronous compaction), LogStore data visibility in full scans, commercial deployment of Git‑style management and Flink Unify Sink, cold/hot data labeling, column‑level updates, and partial column addition/deletion mechanisms.
Thank you for attending.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.