Big Data 17 min read

Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices

This article explores how Tencent integrates big data with cloud‑native technologies, detailing the evolution, opportunities, challenges, the peak‑range (FengLuan) architecture, engine and scheduling layers, mixed‑workload strategies, runtime optimizations, and future directions for large‑scale data platforms.

DataFunSummit

Aug 25, 2023

Big Data Meets Cloud Native: Tencent's Cloud‑Native Big Data Architecture, Challenges, and Practices

Today we share the theme of big data meeting cloud native, focusing on Tencent's big data cloud‑native practice.

1. Opportunities and Challenges of Big Data Cloud‑Native

Big data has evolved through four stages: single‑node engines, distributed commercial engines, the Hadoop ecosystem, and now cloud‑native integration where Spark, Flink, and other engines adopt Kubernetes for resource scheduling. Current challenges include storage‑compute separation, shuffle mechanisms, and data integration.

Advantages of Cloud Native for Big Data

Cost savings through elastic cluster auto‑scaling and mixed offline/online resource usage.

Efficiency gains via declarative Kubernetes APIs, automated deployment, and shared monitoring/logging components.

Improved availability with automatic disaster recovery, resource isolation, and online upgrade capabilities.

Challenges of Cloud Native for Big Data

Elasticity and scheduling requirements differ from traditional online services.

Scale challenges due to API‑server‑centric design limiting throughput.

Architectural challenges such as storage‑compute separation, mixed‑workload support, and runtime modifications.

2. Tencent Cloud‑Native Big Data Architecture

The FengLuan (Peak) platform consists of four layers:

Engine Adaptation Layer : Cloud‑native modifications to engines, Alluxio caching, shuffle support, observability, resource management, and autopolicy.

Core Scheduling Layer : Virtual cluster control, topology and gang scheduling, elastic resource pools, and support for both private and public clouds.

Mixed‑Workload Support Layer : Scheduling‑aware resource isolation and containerized runtime enhancements such as hot migration and memory compression.

Infrastructure Layer : Offline exclusive, shared, and other compute resources; HDFS‑based distributed storage and COS object storage.

Engine Layer Details

Traditional Hadoop engines (YARN, Flink, Presto) run on static resources. In FengLuan, Spark and Flink use native Kubernetes, Presto runs as an operator, enabling unified resource pools and better utilization.

The platform implements a Remote Shuffle Service (RSS) to overcome local‑disk limitations during shuffle, allowing data to be stored in memory, local disks, or HDFS and pulled by reducers without relying on compute node disks.

Benchmarking shows RSS improves performance and reduces failure rates for large‑scale shuffle‑intensive jobs.

Scheduling Layer Details

FengLuan introduces a virtual‑cluster architecture with three tiers: Tenant logical clusters (exposed to users), Meta clusters (controllers, proxies, syncers, balancers), and physical clusters (actual nodes). This design isolates workloads, balances resources across clusters, and reduces API‑server pressure.

Key components:

Syncer synchronizes Pods and related resources from virtual to physical clusters.

Proxy API Server aggregates node and Pod status from all physical clusters, enabling global scheduling decisions.

The scheduler supports strong isolation for multi‑tenant public‑cloud scenarios and implements quota‑bypass mechanisms, fair‑share algorithms, and automatic resource balancing.

Mixed‑Workload (Mix‑Placement) Layer

Using the Caelus component, FengLuan achieves full‑scene offline‑online mixed placement. A master controller manages global placement, while modules handle resource profiling, interference detection, and metric collection.

Runtime hooks at the OCI layer enable non‑intrusive management of offline tasks, and eBPF‑based monitoring correlates kernel metrics with business metrics to detect interference.

Runtime Layer

FengLuan provides hot migration to relocate offline tasks without IP or network disruption when online workloads require resources.

Memory compression (moderate and aggressive strategies) reduces memory consumption by 12‑19% with minimal CPU overhead, improving overall resource utilization.

3. Summary and Outlook

Key technical contributions include:

Self‑developed Remote Shuffle Service for storage‑compute separation.

Virtual‑cluster architecture delivering strong isolation and massive‑scale scheduling.

Custom big‑data scheduler achieving >10× throughput.

First‑ever runtime hot migration for big‑data workloads.

Full‑scene offline mixed placement reducing costs.

Open‑source releases: Tencent Caelus, Uniffle, JDK Kona, InLong, and the Cloud‑Native Big Data Platform Maturity Model (Part 6).

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Resource Scheduling Distributed Computing Tencent

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.