Industry Insights 22 min read

How Cloud‑Native Principles Transform Big Data Infrastructure

The article analyzes how cloud‑native concepts such as DevOps, micro‑services, continuous delivery, and containerization can be applied to big‑data foundations, outlining four guiding principles—industrialized delivery, cost quantification, load‑adaptive scaling, and data‑centric design—and describing concrete Hadoop‑based architectures and Tencent Cloud solutions that lower cost while boosting performance.

Tencent Cloud Developer

May 19, 2021

How Cloud‑Native Principles Transform Big Data Infrastructure

1. Core Ideas of Cloud‑Native

Cloud‑native, originally coined by Matt Stine, is widely recognized to consist of four elements: DevOps, micro‑services, continuous delivery, and containerization. Together they aim to industrialize software production, reducing cost and increasing efficiency.

DevOps integrates development and operations through agile culture and tooling, enabling continuous delivery without downtime. Micro‑services break monolithic applications into loosely coupled services, guided by Conway’s Law, though improper decomposition can cause chaos. Containerization leverages Kubernetes and Docker to provide elastic, dynamically scheduled workloads that improve resource utilization.

2. Defining Cloud‑Native for Big Data

Big data refers to datasets that exceed the capabilities of traditional databases in scale, velocity, variety, and value density. Cloud‑native big data therefore means using cloud infrastructure to acquire, manage, store, and analyze massive datasets while achieving cost reduction and efficiency gains.

3. Principles to Achieve Cloud‑Native Big Data

Industrialized Delivery : Provide on‑demand, minute‑level provisioning of data‑processing clusters with built‑in governance and operations.

Cost Quantification : Measure storage and compute consumption per workload to enable transparent billing.

Load‑Adaptive Scaling : Dynamically adjust resource allocation according to data volume and processing demand.

Data‑Centric Design : Treat data as the primary asset; the platform should serve data analysis needs rather than the opposite.

4. Implementing the Principles with Hadoop Ecosystem

The current Hadoop stack remains the de‑facto standard for big‑data processing. To make it cloud‑native, each component must support the four principles:

Industrialized Delivery : Use cloud‑based EMR to spin up clusters on demand, manage APIs for job submission, and release resources after completion.

Cost Quantification : Expose storage and compute metrics for each cluster, enabling per‑job accounting.

Load‑Adaptive Scaling : Leverage YARN’s resource monitoring (vcore, vmem) or time‑based policies to auto‑scale clusters.

Data‑Centric Design : Provide unified data APIs, authentication, and authorization so applications focus on data insights.

Four deployment patterns are described:

Traditional Mode : Full‑stack IDC clusters with optional cloud‑based optimizations for storage and compute.

Compute‑Storage Separation : Store data in cloud object storage; launch large compute clusters only when needed.

Hybrid Cloud : Bridge on‑premise IDC clusters with cloud EMR via VPN or dedicated lines, extending compute capacity.

Hybrid Compute : Combine container clusters (TKE/STKE) with EMR, shifting idle container resources to big‑data workloads.

These patterns enable flexible use of cloud storage and compute, dramatically lowering hardware costs.

5. Tencent Cloud Big Data Solutions

Tencent Cloud offers a suite of services that embody the above principles:

Data Ingestion : Object storage tools, CDC streams, Kafka, Sqoop, and Spark for batch import.

Processing Engines : EMR (Spark), DLC (Spark/SQL), Oceanus (Flink) for streaming, and ClickHouse/GP for real‑time warehousing.

Data Service Layer : Unified metadata management (DLF), query‑level resource control, and pay‑per‑scan pricing.

Performance Optimizations : CacheService for storage‑compute separation, YARN‑aware scheduling, and hardware‑level tuning (e.g., using IT series instances for low‑latency workloads).

For load‑adaptive scaling, EMR supports both load‑based and time‑based auto‑scaling, adjusting cluster size while preserving SLA by avoiding task failures during scale‑in.

6. Vision for the Next‑Generation Big Data Engine

The author envisions a future engine that unifies authentication, resource‑group‑based SQL execution, fine‑grained scheduler control, and hardware‑aware execution (e.g., RDMA, DPDK) to eliminate cross‑node performance penalties and simplify the data‑service stack.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Big Data cloud-native cost optimization Hadoop Data Infrastructure

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.