Big Data 12 min read

Tencent Real-Time Lakehouse Intelligent Optimization Practice

This presentation details Tencent's real‑time lakehouse architecture and the four key topics—lakehouse design, intelligent optimization services, scenario‑driven capabilities, and future outlook—covering components such as Spark, Flink, Iceberg, Auto‑Optimize Service, indexing, clustering, AutoEngine, and PyIceberg implementations.

DataFunSummit

Jan 14, 2025

The talk, titled “Tencent Real‑Time Lakehouse Intelligent Optimization Practice” and delivered by senior engineer Chen Liang, is organized around four main sections: lakehouse architecture, intelligent optimization services, scenario‑driven capabilities, and a summary with future outlook.

Lakehouse Architecture : The architecture consists of three layers. The compute layer uses Spark for batch ETL, Flink for near‑real‑time streaming, and StarRocks/Presto for ad‑hoc OLAP queries. The management layer centers on Iceberg, exposing simple APIs and an Auto‑Optimize Service that improves query performance and reduces storage costs. Storage is built on HDFS and Tencent Cloud Object Storage (COS), with Alluxio providing a unified cache layer.

Intelligent Optimization Service : This service comprises six modules—Compaction Service (small‑file merging), Expiration Service, Cleaning Service, Clustering Service (data redistribution), Index Service (secondary indexing), and Auto Engine Service (engine‑aware partition heating). Each module targets specific performance or cost‑efficiency challenges.

Compaction Service : Small‑file merging is performed in read and write phases. The Parquet storage model (RowGroup, Column Chunk, Page, Footer) is leveraged to apply RowGroup‑level or Page‑level copy strategies, achieving more than a five‑fold reduction in merge time and resources. Additional optimizations include Delete‑File merging using Left Anti Join and Bloom index acceleration, as well as incremental rewrite based on modify timestamps.

Index Service : Iceberg’s built‑in min‑max index is extended with a secondary index to improve data skipping. A metrics framework records scan and filter events, enabling an intelligent recommendation engine that suggests appropriate indexes based on query frequency, cardinality, and other heuristics. The end‑to‑end workflow covers SQL extraction, coarse filtering, index construction, dual‑run evaluation, and user‑facing recommendations.

Clustering Service : To overcome the coarse granularity of min‑max indexes, data is repartitioned using Z‑order. Columns are digitized, range IDs are computed, and a Z‑Value is generated for interleaved sorting, resulting in more than a four‑fold improvement in query speed.

AutoEngine Service : By listening to OLAP engine events, the service heats hot partitions and routes them to StarRocks, allowing the upper‑layer engine to discover metadata and benefit from storage‑compute engine selection.

Scenario‑Driven Capabilities : Multi‑stream join is realized by tagging Iceberg branches and asynchronously compacting them, enabling seamless merging of data from multiple MQ sources. A primary‑key table approach introduces bucketed writes and column‑family concepts for row‑level updates. In‑place migration tools convert existing Hive/Thive tables to Iceberg by generating new metadata without moving data files, supporting strict, append, and overwrite modes, while a new name‑mapping mechanism enhances partition pruning. PyIceberg offers a JVM‑free Python API for creating Pandas, TensorFlow, and PyTorch DataFrames directly on Iceberg tables, facilitating AI model training.

Summary and Outlook : Future work will focus on extending the Auto‑Optimize Service (cold‑hot separation, materialized view acceleration, intelligent sensing, compaction refinement, transform UDF and partition pruning), enhancing primary‑key tables with deletion vectors, and exploring AI‑centric lakehouse formats and distributed DataFrame integrations. The presenter thanks the audience for their attention.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Flink data optimization Spark Iceberg Lakehouse Auto Optimize

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.