Big Data 8 min read

Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

This article presents a series of seven technical case studies—including Tencent Cloud’s Iceberg‑based batch‑stream integration, AI‑driven data governance with Apache Gravitino, Xiaohongshu’s lakehouse evolution, and a multimodal data‑lake solution—detailing challenges, architectural designs, implementation steps, performance results, and future directions.

DataFunTalk
DataFunTalk
DataFunTalk
Exploring Tencent Cloud’s Iceberg Batch‑Stream Integration and AI‑Driven Data Governance

1. Batch‑Stream Unified Processing with Apache Iceberg

Tencent Cloud extends the Apache Iceberg table format (TC‑Iceberg) to support low‑latency updates and deletes in a lake‑warehouse architecture. The extension introduces a dual‑store model :

Base store holds immutable snapshot data.

Change store records incremental inserts, updates, and deletes.

Read operations perform merge‑on‑read by joining base and change stores, while a background auto‑compaction job periodically merges change files into the base to limit read‑write amplification.

An automatic bucketing mechanism hashes the primary key to a configurable number of buckets, localising merge ranges and enabling parallel distributed merges with minimal data shuffling.

Key parameters:

bucket_count = 256   # example value
compaction_interval = 6h   # trigger auto‑compaction every 6 hours

Benchmarks show sub‑second query latency for point‑lookup workloads and stable throughput for high‑frequency streaming ingest.

2. AI‑Driven Governance‑as‑a‑Service for Gaming Data

The solution automates metadata extraction, policy enforcement, and compliance checks across heterogeneous game data sources. Core components include:

Real‑time schema detection using lightweight ML classifiers.

Policy templates expressed in JSON‑Logic, applied via a rule engine.

Continuous audit trails stored in an immutable log for downstream risk analysis.

Integration points expose RESTful APIs and Kafka connectors for seamless ingestion into existing data pipelines.

3. Unified Metadata and Lineage with Apache Gravitino

Gravitino provides a centralized catalog that abstracts tables, views, and files across multi‑cloud storage systems. To achieve end‑to‑end lineage:

Gravitino’s metadata model is extended with OpenLineage Facet objects that carry custom attributes (e.g., column‑level provenance).

Each processing engine (Spark, Flink, Trino, etc.) registers a LineageCollector plugin that emits OpenLineage events enriched with Gravitino facets.

Collected events are stored in a dedicated lineage_events table; queries can reconstruct field‑level data flow across engines.

Example plugin registration (Spark):

spark.conf.set("spark.openlineage.facets.enabled", "true")
spark.conf.set("spark.openlineage.gravitino.catalog", "gravitino_catalog")

The architecture supports incremental lineage capture, reducing overhead to <5 % of job execution time.

4. Lakehouse Evolution at Xiaohongshu

Xiaohongshu migrated from a Lambda architecture to a unified incremental computation model built on Iceberg tables and a custom incremental engine. Technical highlights:

Z‑Order clustering on high‑cardinality columns to co‑locate related rows.

Smart indexing that builds Bloom filters per data block, cutting scan volume by ~10×.

Incremental engine consumes change logs, writes to the change_store, and triggers merge‑on‑read queries.

Performance results (P90 latency):

Legacy Lambda: ~30 s

Lakehouse with Z‑Order & smart index: ~5 s

Use cases include real‑time recommendation feeds and e‑commerce transaction analytics.

5. Enterprise‑Grade AI‑Native Multimodal Data Platform

Volcano Engine’s multimodal data lake combines three core components:

LAS AI : a library of ready‑to‑use operators for text, image, video, and vector preprocessing.

LAS Ray : a scheduler that dispatches heterogeneous compute tasks (CPU, GPU, TPU) across a Kubernetes cluster.

LAS Lance format: columnar storage with native primary‑key and vector indexes, enabling O(1) random access for embedding vectors.

ByteHouse is integrated as a hybrid query engine, allowing SQL over both structured columns and vector similarity search.

Key configuration example for vector index creation:

CREATE TABLE images (
  id BIGINT PRIMARY KEY,
  embedding VECTOR(768)
) USING LANCE;

ALTER TABLE images ADD VECTOR INDEX ON embedding;

Typical workloads:

Pre‑training large multimodal models (data ingestion > 10 TB/day).

Fine‑tuning with domain‑specific image‑text pairs.

Enterprise AI search combining keyword and semantic similarity.

Video frame extraction and feature indexing for downstream analytics.

6. Large Language Model Interaction with Databases

Beyond simple query generation, modern LLMs can act as data managers :

Generate DDL/DML statements based on natural‑language intent.

Validate schema compatibility and suggest migrations.

Perform automated data quality checks and trigger remediation workflows.

Embedding‑based retrieval is used to map user queries to relevant tables, reducing hallucination risk.

big dataAImetadatamultimodaldata lakeIceberg
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.