Cloud Native 41 min read

What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices

This article explores the evolution and architecture of cloud‑native data platforms, covering their historical roots, modern components such as storage layers, ingestion, processing, metadata, and consumption, and offers practical guidance on selecting tools, designing pipelines, and implementing best‑practice strategies for scalable, flexible data infrastructure.

GuanYuan Data Tech Team
GuanYuan Data Tech Team
GuanYuan Data Tech Team
What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices

Development History

Early data platforms started with relational databases (RDBMS) for OLTP, evolving to OLAP for analysis. In the 1980s data warehouses emerged, with Bill Inmon's "Building the Data Warehouse" and Ralph Kimball's "The Data Warehouse Toolkit" establishing foundational methodologies. In the 1990s commercial RDBMS such as IBM DB2, SQL Server, Teradata, and Oracle dominated the market.

With the rise of the internet in the early 2000s, "big data" challenges appeared. Classic MapReduce, GFS, and BigTable papers introduced a distributed data system stack distinct from traditional RDBMS. Hadoop popularized this stack (HDFS, YARN, HBase, Hive) and sparked a decade of ecosystem growth, including data lakes, NoSQL, and SQL‑on‑Hadoop solutions.

Cloud computing, led by AWS, exposed Hadoop's limitations—tight coupling of storage and compute, high operational cost, and poor support for streaming. The cloud‑native era pushes for managed services, elasticity, and pay‑per‑query models, prompting the emergence of new cloud‑native data platforms.

Data Platform Architecture

Classic Data Warehouse Architecture

Traditional data platforms rely on ETL tools to load source data into a warehouse, then expose services via SQL.

Traditional Data Platform
Traditional Data Platform

Data Lake Architecture

Data lakes store diverse semi‑structured and unstructured data (JSON, Avro, ProtoBuf, text, images, audio, video). They require more flexible processing beyond pure SQL.

Cloud‑Native Architecture

Cloud‑native platforms combine SaaS components, managed services, and pay‑as‑you‑go storage. They integrate data‑lake and warehouse concepts (lakehouse) and support both batch and streaming workloads.

Lambda Architecture
Lambda Architecture
Kappa Architecture
Kappa Architecture

a16z's unified data architecture diagram illustrates the data flow from source, ingestion, storage, query processing, transformation, to analysis, allowing selective component adoption.

a16z Unified Data Architecture
a16z Unified Data Architecture
Cloud Data Platform Architecture
Cloud Data Platform Architecture

Data Acquisition

Platforms must support both batch and streaming ingestion.

Batch Acquisition

Typical sources include files, FTP, or APIs; data is fetched on a schedule, providing near‑real‑time updates if the schedule is frequent.

Streaming Acquisition

Streaming is essential for low‑latency use cases such as recommendation or fraud detection. CDC captures real‑time changes from databases, avoiding loss of intermediate state.

Recommended path: first build stable batch ingestion, then add streaming acquisition, and finally streaming processing (e.g., Flink).

Various Data Ingestion
Various Data Ingestion

Requirements & Products

Plugin architecture for many source types.

Operational observability and error handling.

Performance and stability for large volumes.

Products include cloud services (AWS Glue, Google Cloud Data Fusion, Azure Data Factory), SaaS (Fivetran, Stitch, Airbyte), open‑source (Apache NiFi, Kafka, Pulsar, Debezium), and serverless custom solutions.

Product Selection Trade‑off
Product Selection Trade‑off

Data Storage

Storage is split into slow (object‑store‑based lake) and fast (real‑time stores).

Slow Storage

Object storage (S3, GCS, ABS) forms the data lake; lakehouse adds metadata, schema evolution, and transactions (Delta, Iceberg).

Fast Storage

Fast stores support low‑latency queries (ClickHouse, KV stores) or streaming system storage (Kafka tiered storage).

Fast and Slow Storage
Fast and Slow Storage

Requirements & Products

Reliability (no data loss).

Scalability.

Performance (high throughput for slow, low latency for fast).

Typical choices: cloud vendor services, or open‑source projects like lakeFS, JuiceFS, SeaweedFS.

Data Processing

Batch Processing

Apache Spark is the dominant engine; Hive, Presto, Dremio also serve SQL workloads.

Streaming Processing

Apache Flink, Spark Streaming, and Kafka Streams handle real‑time pipelines; results are often written to analytical databases (ClickHouse, Pinot) or search stores.

Unified Batch‑Streaming (Flow‑Batch‑One)

Flink treats batch as a special case of streaming; Apache Beam provides a unified DSL that can run on Spark or Flink.

Flink Unified Batch‑Streaming
Flink Unified Batch‑Streaming
Beam Unified Model
Beam Unified Model

Requirements & Products

Horizontal scalability.

Stability and failover.

Open APIs (SQL, SDKs).

Managed services include AWS EMR, Google Dataproc, Azure Databricks, Kinesis Data Analytics, Cloud Dataflow, Confluent, Upsolver, Materialize.

Metadata

Metadata includes platform configuration, lineage, schema, statistics, and operational logs, enabling monitoring, debugging, and governance.

Platform Metadata

Tracks data source configs, job status, schema evolution, and resource usage.

Business Metadata

Supports data catalog, tagging, discovery, and compliance (privacy, governance).

Open‑source solutions: Marquez, Apache Atlas, Amundsen, DataHub, Atlan, Alation. Cloud catalogs: AWS Glue Data Catalog, Google Data Catalog, Azure Data Catalog.

Platform Metadata Types
Platform Metadata Types
Schema Registry
Schema Registry
DataOps Cycle
DataOps Cycle
Atlan Features
Atlan Features
BigEye
BigEye

Data Consumption

Consumption includes BI queries, data‑science notebooks, real‑time APIs, and metric stores.

Analytical Queries

Modern warehouses (BigQuery, Redshift, Snowflake) and lakehouse engines (Photon, Presto, Dremio) provide fast, interactive SQL.

Data Science

Python notebooks read raw files from slow storage for model training; feature stores and MLOps integrate batch and real‑time features.

Real‑Time Consumption

Results can be pushed to relational databases, KV stores, caches (Redis), or search engines (Elasticsearch). Flink also supports real‑time queries via custom APIs.

Security & Governance

Access control, auditing, encryption, and data‑masking are essential; solutions include cloud IAM, Auth0, Immuta, and network security services (VPC, VPN).

Service‑Layer Products

Metric stores and modeling layers such as LookML, Transform, and Metlo expose unified data services.

Metric Store
Metric Store

Orchestration & ETL

Workflow engines schedule pipelines, handle dependencies, retries, and monitoring.

Orchestration

Tools include Airflow, Cloud Composer, Dagster, Prefect, Flyte, Argo (KubeFlow Pipelines), and DolphinScheduler.

ETL

Low‑code/no‑code platforms (SmartETL, Fivetran, Stitch, Airbyte) simplify transformation; dbt brings software‑engineered data modeling.

Orchestration Diagram
Orchestration Diagram
SmartETL
SmartETL
dbt
dbt

Best Practices

Data Layering

Adopt classic warehouse layers (bronze, silver, gold) and lakehouse equivalents to separate raw, cleaned, and production data.

Data Warehouse Layers
Data Warehouse Layers
Lakehouse Architecture
Lakehouse Architecture

Separate Streaming Ingestion from Streaming Analytics

Use streaming ingestion to feed warehouses for near‑real‑time dashboards; use dedicated streaming analytics for high‑frequency, low‑latency use cases (e.g., gaming).

Streaming Ingestion Architecture
Streaming Ingestion Architecture
Streaming Analytics Architecture
Streaming Analytics Architecture

Control Cloud Costs

Monitor resource usage, design hot‑cold storage, and optimize data partitioning to avoid unnecessary expenses.

Avoid Tight Coupling

Design loosely coupled interfaces, encapsulate vendor‑specific APIs, and keep components replaceable.

Data Platform Construction

Business Value

Data platforms drive efficiency, revenue growth, innovation, and compliance.

Construction Path

Follow an agile, scenario‑driven roadmap: self‑service BI → metric layers → automated pipelines → AI‑enhanced insights → action‑oriented APIs.

Data Platform Maturity
Data Platform Maturity
cloud-nativemetadatadata architecturebig-datadata-platform
GuanYuan Data Tech Team
Written by

GuanYuan Data Tech Team

Practical insights from the GuanYuan Data Tech Team

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.