How Cloud‑Native Data Lakes Slash Costs and Boost Performance on Public Cloud
The article analyzes the challenges of moving traditional on‑premise big‑data platforms to the cloud, outlines the cost‑saving opportunities of cloud‑native data lakes, presents three core architectural principles, and reviews Tencent Cloud's data lake product suite and its key use cases.
Challenges of Migrating Traditional On‑Premise Big Data Platforms to the Cloud
Low utilization / poor timeliness : Over‑provisioned clusters lead to low CPU/RAM usage and delayed data production.
Inflexibility : Hard to respond to ad‑hoc or back‑fill jobs; cluster upgrades and data migration are cumbersome.
High cost : Mismatch between HDFS storage size and compute capacity wastes resources; cloud VM hourly rates are high; HDFS maintenance adds operational expense.
Poor performance : Uniform instance types cannot be tuned for workloads with heavy shuffle I/O or different IOPS requirements.
Reliability concerns : Multi‑AZ HDFS deployments suffer from limited cross‑AZ bandwidth and complex disaster‑recovery procedures.
Opportunities in the Public‑Cloud Shared Economy
Elastic computing : Spot instances and auto‑scaling can reduce compute cost to 10‑30% of on‑demand pricing.
Object storage : Cloud object stores (e.g., COS) provide 1:5‑1:10 cost advantage over HDFS, strong SLA (four‑nine availability) and eleven‑nine durability.
Diverse instance families : A richer set of VM types enables workload‑specific performance tuning.
Core Principles of a Cloud‑Native Data Lake Architecture
1. Object Storage (Separation of Compute and Storage)
Replacing HDFS with cloud object storage yields:
Significant cost reduction (1:5‑1:10 vs. HDFS).
Four‑nine availability and eleven‑nine durability, surpassing typical HDFS setups.
Native features such as versioning, lifecycle policies, cross‑region backup, event‑driven triggers, and pay‑per‑access.
Elimination of the compute‑storage capacity mismatch.
Shared data pool for multiple workloads, reducing synchronization complexity.
Direct use of object storage introduces technical challenges:
No atomic rename operation – commit phases of distributed jobs become slow.
Eventual consistency can cause task failures or stale reads.
List operations are slower, increasing scan latency for analytical queries.
2. Elastic Computing
Elastic resources, especially spot VMs, cut idle‑time waste and support bursty back‑fill workloads. Shuffle‑heavy jobs can hinder scaling; therefore efficient shuffle placement and isolation are required. Migrating from Yarn’s static scheduler to a Kubernetes‑based orchestrator is a common step, enabling fine‑grained pod scaling, node‑pool management, and automatic fault‑tolerance.
3. Performance Boost via Caching and Modeling Innovations
Because object storage removes data locality, compensating mechanisms are needed:
Caching : In‑memory or SSD caches can achieve 70‑80% hit rates for read‑only workloads (as reported by Snowflake), dramatically reducing object‑store read latency.
Data layout optimizations : Sparse indexes, partitioning, bucketing, and columnar formats (Parquet, ORC) shrink the amount of data scanned.
AP‑to‑TP format convergence : Storage engines such as ClickHouse adopt analytical‑processing‑oriented layouts that also support transactional workloads, delivering high query throughput with modest storage overhead.
Typical Cloud‑Native Data Lake Components
Data Lake Compute (DLC) : A serverless SQL engine that runs queries directly on object storage (COS) without provisioning dedicated compute clusters.
Data Lake Formation (DLF) : A metadata‑centric service that provides fast lake construction, unified catalog, multi‑source ingestion, workflow orchestration, and fine‑grained access control.
Reference URLs (plain text): https://cloud.tencent.com/product/dlc and https://cloud.tencent.com/product/dlf
Representative Use Cases
Data lake construction : Rapid ingestion and preparation of heterogeneous data sources.
Direct analysis : SQL queries over COS objects (CSV, JSON, Avro, Parquet, ORC) without loading into a separate engine; integrates with BI tools.
Federated analysis : Cross‑source SQL across object storage, cloud databases, and big‑data services, eliminating ETL bottlenecks.
Unified metadata governance : Central catalog and enterprise‑level permission management for all data assets.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
