Big Data 11 min read

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices

This article presents Tencent's real‑time lakehouse architecture, detailing its three‑layer design of compute, management and storage, and explains the six components of the Intelligent Optimization Service—including Compaction, Index, Clustering, and AutoEngine—along with scenario‑based capabilities, migration strategies, and future optimization directions.

Big Data Technology & Architecture

Nov 25, 2024

Tencent Real-Time Lakehouse Architecture and Intelligent Optimization Practices

Tencent's big data platform adopts a lakehouse architecture that consists of three layers: data lake compute (Spark for batch ETL, Flink for near‑real‑time streaming, StarRocks and Presto for ad‑hoc OLAP queries), data lake management centered on Apache Iceberg with an Auto Optimize Service, and storage backed by HDFS, COS, and an Alluxio cache layer.

The Intelligent Optimization Service is composed of six modules:

Compaction Service : merges small files using row‑group‑level or page‑level copy strategies, reduces merge time and resource consumption by over five times, and adds Delete‑Files merging and incremental rewrite optimizations.

Index Service : extends Iceberg's min‑max index with secondary indexes, collects scan and filter metrics, and provides an end‑to‑end index recommendation workflow that reconstructs SQL, performs coarse filtering, builds incremental indexes, and evaluates effectiveness via dual‑run tasks.

Clustering Service : re‑partitions data using Z‑order to improve data skipping, achieving more than four‑fold query performance gains.

AutoEngine Service : captures hot partition events from OLAP engines and routes the corresponding data to StarRocks, enabling storage‑compute engine selection optimization.

Scenario‑based Capabilities : supports multi‑stream join via tag‑based branching and asynchronous compaction, primary‑key tables for row‑level updates with bucket rescaling and column‑family concepts, and in‑place migration of legacy Hive/Thive tables to Iceberg using metadata‑only operations.

PyIceberg : offers a JVM‑free Python client for Iceberg metadata, enabling seamless integration with Pandas, TensorFlow, PyTorch, and DuckDB for data analysis and AI model training.

The presentation concludes with a roadmap focusing on further enhancements to the Auto Optimize Service (cold‑hot separation, materialized view acceleration, intelligent sensing, compaction refinement, and advanced partition pruning), primary‑key table optimizations (deletion vectors, predicate push‑down), and AI‑driven lakehouse formats and distributed DataFrame implementations.

Big Data Real-time analytics data optimization Tencent

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.