Big Data 13 min read

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

This article analyzes Tencent Cloud’s DLC lakehouse solution, explaining the unified data lake‑warehouse architecture, the performance hurdles of object‑storage‑based analytics, and the multi‑dimensional caching, virtual‑cluster elasticity, and advanced filter techniques that enable second‑level analysis on petabyte‑scale data while reducing costs.

Tencent Cloud Developer

Jan 3, 2023

How Tencent’s Cloud‑Native Lakehouse Tackles PB‑Scale Performance Challenges

In recent years, Tencent has deployed data‑lake capabilities across products such as WeChat Channels and Mini‑Programs, reaching petabyte (PB) to exabyte (EB) scales. Building on this, Tencent Cloud launched a cloud‑native lakehouse (DLC) to address the growing demand for unified storage and analytics.

Background and Definition of a Lakehouse

A lakehouse combines the flexibility of a data lake with the structured performance of a data warehouse, unifying storage for semi‑structured, unstructured, and structured data. Object storage in the public cloud offers massive capacity, low cost, high SLA, and reliability, making it an attractive alternative to self‑managed HDFS (cost ratio ~1:10).

Key Challenges of Cloud‑Native Lakehouse Architecture

Performance loss of over 30% due to compute‑storage separation when using object storage.

Elastic scaling conflicts with analytical workloads; container startup can take 1–2 minutes, and client pre‑warming adds additional latency.

Data skew and the need for agile, real‑time analysis of massive detail data.

What DLC Offers

DLC is a fully managed, multi‑tenant service that provides out‑of‑the‑box metadata, security, Spark DDL Server, Spark History, and other components, many of which are free or have generous free quotas. It acts as a glue layer for Tencent’s cloud data‑lake ecosystem, delivering low‑cost, low‑maintenance analytics.

Architecture Principles

The solution is container‑based and runs on Kubernetes (K8s). It leverages cloud‑native managed services such as COS, TDSQL, and Cloud Kafka, minimizing the need for additional services and keeping the stack simple (KISS) and cloud‑native.

Performance‑Boosting Techniques

1) Multi‑Dimensional Cache

File Cache: Alluxio Local Cache eliminates the need for a separate Alluxio cluster, providing a cache without extra compute resources. It can improve COS read performance by 3–10×, though it requires careful handling of cache consistency, security, and elasticity.

Fragment Result Cache: Caches query results without pre‑computation, offering up to 10× speedup (e.g., RaptorX from the Presto community) while reducing storage pressure.

2) Virtual Cluster Elastic Model

Sub‑clusters serve as the smallest scaling unit, providing stable topology, pre‑warmed resources, and query isolation. This design shortens container startup time and enables rapid scaling while maintaining query performance.

3) Multi‑Dimensional Filter

Column pruning on Parquet/ORC files ensures only required columns are scanned.

Partition pruning, including dynamic partition pruning, reduces unnecessary I/O.

Sparse indexes (Bloom filter, Z‑ordering) combined with engine‑level predicate push‑down can achieve >10× performance gains and lower storage costs.

Optimized Spark Shuffle Manager

DLC adapts the open‑source RSS service (Filestorm) to use local disks first and spill to COS when needed, simplifying the stack and maintaining high performance with minimal service overhead.

Cost Advantages

Using COS instead of self‑managed HDFS saves over 80% of storage costs. Elastic resources (EKS/TKE) compared to fixed clusters reduce expenses by more than 50%, especially for interactive analysis workloads.

Modeling Shift: Flat Lakehouse Architecture

Traditional layered warehouses require costly ETL and pre‑computation. The flat lakehouse eliminates these layers, allowing direct, high‑performance analysis on detail data and supporting real‑time incremental updates.

Real‑World Example: Gaming Client

Data streams from Kafka are ingested by DLC Spark, written to Iceberg detail tables with idempotent handling. DLC SuperSQL‑Spark performs cleaning, small‑file merging, and sparse‑index construction. The resulting tables are queried with DLC SuperSQL‑Presto for second‑level analysis, feeding BI tools without additional modeling.

Conclusion

DLC delivers a fully managed, cloud‑native lakehouse that unifies storage and compute, leverages multi‑dimensional caching, virtual‑cluster elasticity, and advanced filtering to achieve petabyte‑scale, second‑level analytics while offering significant cost savings and operational simplicity.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

performance Big Data cloud-native Caching Lakehouse virtual cluster DLC

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.