How Tencent Cloud Uses Iceberg, Gravitino and Multimodal Lakes for Unified Data Processing
This article series explores Tencent Cloud's Iceberg‑based batch‑stream integration, Apache Gravitino's unified metadata and lineage solution, Xiaohongshu's data‑architecture evolution for the Big AI Data era, and a practical Data+AI multimodal data‑lake implementation, highlighting challenges, architectural designs, and performance gains.
1. Batch‑Stream Integration on Apache Iceberg
The implementation extends Apache Iceberg (TC‑Iceberg) with a dual‑store architecture : a base store holding stable snapshots and a change store accumulating real‑time inserts, updates, and deletes. A merge‑on‑read strategy reads the base snapshot and applies pending changes on the fly, while an auto‑compaction daemon periodically rewrites merged files to limit read‑write amplification. To improve merge parallelism, the system automatically hashes primary‑key values to create hash buckets , localising the merge range and reducing network shuffle. The design also includes materialised view support and a roadmap toward sub‑second query latency.
2. Unified Metadata and Lineage with Apache Gravitino
Apache Gravitino provides a unified metadata catalog that can store table schemas, partitions, and access control across heterogeneous data sources. By embedding the OpenLineage collection framework, Gravitino captures lineage events from multiple compute engines (e.g., Spark, Flink, Trino). The Facet extension enriches each lineage event with custom attributes, enabling cross‑engine lineage mapping and field‑level provenance . The overall governance architecture consists of:
Gravitino metadata service (central catalog)
OpenLineage collectors deployed alongside each engine
Facet‑based event schema for extensibility
Source code and example configurations are available in the project repository (e.g., https://github.com/apache/gravitino).
3. Incremental Computation in a Lakehouse for Xiaohongshu
To replace the traditional Lambda architecture, Xiaohongshu adopts a universal incremental computation model built on Iceberg storage and an incremental processing engine. Key techniques include:
Z‑Order sorting of Iceberg data files to co‑locate related rows on disk
Smart indexing that maintains auxiliary indexes for fast point‑lookup and range scans
These optimisations reduce the amount of data scanned per query by roughly tenfold, achieving a P90 query latency of 5 seconds in benchmark tests. The architecture supports both batch back‑fills and low‑latency stream updates, enabling use cases such as community feed ranking and e‑commerce recommendation.
4. Multimodal Data Lake for the AI Era
Volcano Engine’s multimodal data‑lake solution combines three core components:
LAS AI : a library of ready‑to‑use operators for preprocessing text, images, video, and vector data.
LAS Ray : a scheduler that dispatches heterogeneous compute resources (CPU, GPU, FPGA) to the appropriate operators.
LAS Lance format: a columnar storage layout that natively supports primary‑key indexes for point queries and vector indexes for approximate nearest‑neighbor search, enabling fast random access during model training.
The Lance format integrates with the ByteHouse engine to provide hybrid SQL‑plus‑vector query capabilities. Demonstrated scenarios include large‑scale model pre‑training, fine‑tuning, enterprise AI search, and video data mining, where the indexed storage reduces I/O latency and accelerates training pipelines.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
