Big Data 7 min read

7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes

This article collection showcases seven advanced data engineering solutions—from Tencent Cloud's Iceberg batch‑stream integration and Apache Gravitino metadata lineage to Xiaohongshu's Lakehouse evolution and multimodal AI data lake implementations—highlighting architectural innovations, performance optimizations, and real‑world deployment insights for modern big‑data platforms.

DataFunSummit

Dec 1, 2025

7 Cutting-Edge Data Engineering Practices Shaping AI-Driven Data Lakes

The following seven technical articles present state‑of‑the‑art practices in data engineering, data governance, and AI‑driven multimodal data platforms, offering concrete designs, performance results, and future directions for building scalable big‑data solutions.

1. Tencent Cloud Batch‑Stream Integration with Iceberg

Based on years of experience with the Apache Iceberg project, Tencent Cloud proposes a TC‑Iceberg extension that introduces a dual‑store architecture— base store and change store . This design enables efficient real‑time updates and deletions while balancing read‑write amplification through merge on read and auto compaction. An automatic bucketing mechanism hashes primary keys to localize merge ranges, dramatically improving distributed merge task efficiency. The article also details business‑level case studies, the intelligent storage service architecture, and future plans for sub‑second latency and materialized view support.

2. Governance as a Service: AI‑Driven Game Data Governance

This piece explores how Tencent Games leverages AI to transform data governance, addressing data silos and heterogeneous sources by treating governance as a service layer that automates quality checks, lineage tracking, and policy enforcement across game analytics pipelines.

3. Apache Gravitino Unified Metadata and Lineage

In the context of multi‑cloud and AI acceleration, the article identifies challenges such as data islands, diverse data sources, and missing metadata. It proposes a solution built on Apache Gravitino that unifies metadata management and lineage tracing. By integrating the OpenLineage collection framework with Gravitino’s unified metadata model, the approach achieves cross‑engine lineage mapping and field‑level tracing. The Facet extension mechanism is explained for propagating metadata within lineage events, and implementation details for multi‑engine lineage collection are provided alongside a roadmap for community development.

4. Evolution of Xiaohongshu’s Data Architecture in the Big AI Data Era

The article describes how Xiaohongshu replaced its Lambda architecture with a unified incremental computation model to tackle high‑complexity data pipelines, resource costs, and latency. Leveraging a Lakehouse design based on Iceberg for storage and an incremental compute engine, the solution merges batch and stream processing. Key techniques such as Z‑Order sorting and intelligent indexing reduced query scan volume by tenfold, and performance benchmarks show P90 query latency improved to 5 seconds. Various business scenarios, including community and e‑commerce, are examined.

5. Rebuilding Data Foundations: Data+AI Multimodal Data Lake

Addressing the surge in unstructured data and multimodal AI demands, the article introduces Volcano Engine’s multimodal data lake solution. Core components include LAS AI (ready‑to‑use data operators), LAS Ray (heterogeneous compute scheduling), and LAS Lance (storage format optimized for multimodal data). LAS Lance supports native primary‑key and vector indexes for fast random access, accelerating model training. The architecture integrates ByteHouse for hybrid queries and demonstrates applications in model pre‑training, fine‑tuning, AI search, and video data mining.

6. Data+AI Multimodal Data Lake in Practice

This segment provides a hands‑on walkthrough of deploying the multimodal data lake, covering cluster setup, data ingestion pipelines, and performance tuning tips, with code snippets and repository links for reference.

7. Large Model and Database Interaction: From Data Consumers to Data Managers

The final article examines the shift in how large language models interact with databases, advocating for a management‑centric approach that embeds lineage, versioning, and governance directly into database services to support AI workloads.

multimodal AI data lake metadata management Apache Iceberg Apache Gravitino lakehouse Batch-Stream Integration

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.