Performance Optimization of Iceberg Real‑time Data Warehouse and Arctic Enhancements
This article presents a comprehensive overview of Iceberg MOR principles, Arctic‑based performance optimizations, benchmark evaluations using CH‑benchmark, and future roadmap items, highlighting how various file‑type strategies, self‑optimizing mechanisms, and task balancing improve real‑time data lake query efficiency.
Introduction
This article shares performance optimization techniques for Iceberg real‑time data warehouses, covering four main aspects: Iceberg MOR principle, Arctic‑based optimizations, benchmark evaluation, and future plans.
01 Iceberg MOR Principle Introduction
1. MOR Overview – Merge On Read (MOR) is an out‑of‑place update technique that records changes separately and merges them at read time, offering low write cost but higher read cost, suitable for real‑time ingestion scenarios.
2. Three Iceberg File Types
Iceberg uses data‑files for normal inserts, equality‑delete‑files for row‑level deletions based on key matching, and position‑delete‑files that delete rows by file position.
3. Equality‑delete Mechanism
During reads, equality‑delete data is loaded into memory, a hash table is built on the specified columns, and matching rows are filtered out.
4. Position‑delete Mechanism
Two approaches are used:
Bitmap construction – position‑delete‑files are read into memory, a bitmap of row numbers to discard is built, and matching rows are omitted during data‑file reads.
Sort‑merge – both data‑files and position‑delete‑files are sorted by row number, allowing a merge‑join style elimination.
5. Iceberg File Organization and Task Structure
Each data‑file can be split into one or more tasks, which serve as the smallest read units.
02 Arctic Based on Iceberg Performance Optimization
1. Arctic Overview
Arctic is NetEase’s open‑architecture lake‑warehouse system built on Iceberg, offering stream‑and‑update‑oriented optimizations and a self‑optimizing mechanism.
2. Why Optimize
Challenges include small files from frequent checkpoints, excessive delete files increasing storage and read cost, inefficient data organization, and lingering stale files.
3. Self‑optimizing Features
Provides automatic execution, resource isolation via groups and quotas, and flexible deployment on Flink (YARN, K8s, AMS).
4. Small File Merging
Merges many small files into larger ones, reducing NameNode pressure and Iceberg metadata.
5. Delete File Elimination
Combines delete files with data files to reduce file count and read overhead.
6. Equality‑delete to Position‑delete Conversion
Transforms high‑memory‑cost equality‑deletes into low‑memory position‑deletes, improving query performance when delete ratios are low.
7. Self‑optimizing Performance Impact
Charts show that without self‑optimizing, Iceberg MOR performance degrades sharply after 60‑90 minutes, while with self‑optimizing it remains stable.
8. Delete File Reuse and Task Balancing
Issues with repeated delete file reads are mitigated by mixed Iceberg format strategies: file grouping by hash, delete file reuse across tasks, and balanced task assignment using a greedy partitioning algorithm.
Performance tests demonstrate significant gains in real‑time workloads.
9. Impactful Parameters
File type (Parquet vs. Avro) and compression type heavily influence query performance and resource consumption.
03 Optimization Effect Evaluation
1. TPC‑C & TPC‑H
Traditional benchmarks are not ideal for evaluating row‑level updates in data lakes.
2. CH‑benchmark
Combines TPC‑C transaction workload with adjusted TPC‑H queries to form a complex mixed workload, simulating CDC data ingestion and subsequent analytical queries.
04 Future Plans
Asynchronous global data sorting (including Z‑order).
Asynchronous secondary index construction.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.