Cloud Native 41 min read

What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices

This article explores the evolution and architecture of cloud‑native data platforms, covering their historical roots, modern components such as storage layers, ingestion, processing, metadata, and consumption, and offers practical guidance on selecting tools, designing pipelines, and implementing best‑practice strategies for scalable, flexible data infrastructure.

GuanYuan Data Tech Team

Aug 4, 2022

What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices

Development History

Early data platforms started with relational databases (RDBMS) for OLTP, evolving to OLAP for analysis. In the 1980s data warehouses emerged, with Bill Inmon's "Building the Data Warehouse" and Ralph Kimball's "The Data Warehouse Toolkit" establishing foundational methodologies. In the 1990s commercial RDBMS such as IBM DB2, SQL Server, Teradata, and Oracle dominated the market.

With the rise of the internet in the early 2000s, "big data" challenges appeared. Classic MapReduce, GFS, and BigTable papers introduced a distributed data system stack distinct from traditional RDBMS. Hadoop popularized this stack (HDFS, YARN, HBase, Hive) and sparked a decade of ecosystem growth, including data lakes, NoSQL, and SQL‑on‑Hadoop solutions.

Cloud computing, led by AWS, exposed Hadoop's limitations—tight coupling of storage and compute, high operational cost, and poor support for streaming. The cloud‑native era pushes for managed services, elasticity, and pay‑per‑query models, prompting the emergence of new cloud‑native data platforms.

Data Platform Architecture

Classic Data Warehouse Architecture

Traditional data platforms rely on ETL tools to load source data into a warehouse, then expose services via SQL.

Data Lake Architecture

Data lakes store diverse semi‑structured and unstructured data (JSON, Avro, ProtoBuf, text, images, audio, video). They require more flexible processing beyond pure SQL.

Cloud‑Native Architecture

Cloud‑native platforms combine SaaS components, managed services, and pay‑as‑you‑go storage. They integrate data‑lake and warehouse concepts (lakehouse) and support both batch and streaming workloads.

a16z's unified data architecture diagram illustrates the data flow from source, ingestion, storage, query processing, transformation, to analysis, allowing selective component adoption.

Data Acquisition

Platforms must support both batch and streaming ingestion.

Batch Acquisition

Typical sources include files, FTP, or APIs; data is fetched on a schedule, providing near‑real‑time updates if the schedule is frequent.

Streaming Acquisition

Streaming is essential for low‑latency use cases such as recommendation or fraud detection. CDC captures real‑time changes from databases, avoiding loss of intermediate state.

Recommended path: first build stable batch ingestion, then add streaming acquisition, and finally streaming processing (e.g., Flink).

Requirements & Products

Plugin architecture for many source types.

Operational observability and error handling.

Performance and stability for large volumes.

Products include cloud services (AWS Glue, Google Cloud Data Fusion, Azure Data Factory), SaaS (Fivetran, Stitch, Airbyte), open‑source (Apache NiFi, Kafka, Pulsar, Debezium), and serverless custom solutions.

Data Storage

Storage is split into slow (object‑store‑based lake) and fast (real‑time stores).

Slow Storage

Object storage (S3, GCS, ABS) forms the data lake; lakehouse adds metadata, schema evolution, and transactions (Delta, Iceberg).

Fast Storage

Fast stores support low‑latency queries (ClickHouse, KV stores) or streaming system storage (Kafka tiered storage).

Requirements & Products

Reliability (no data loss).

Scalability.

Performance (high throughput for slow, low latency for fast).

Typical choices: cloud vendor services, or open‑source projects like lakeFS, JuiceFS, SeaweedFS.

Data Processing

Batch Processing

Apache Spark is the dominant engine; Hive, Presto, Dremio also serve SQL workloads.

Streaming Processing

Apache Flink, Spark Streaming, and Kafka Streams handle real‑time pipelines; results are often written to analytical databases (ClickHouse, Pinot) or search stores.

Unified Batch‑Streaming (Flow‑Batch‑One)

Flink treats batch as a special case of streaming; Apache Beam provides a unified DSL that can run on Spark or Flink.

Requirements & Products

Horizontal scalability.

Stability and failover.

Open APIs (SQL, SDKs).

Managed services include AWS EMR, Google Dataproc, Azure Databricks, Kinesis Data Analytics, Cloud Dataflow, Confluent, Upsolver, Materialize.

Metadata

Metadata includes platform configuration, lineage, schema, statistics, and operational logs, enabling monitoring, debugging, and governance.

Platform Metadata

Tracks data source configs, job status, schema evolution, and resource usage.

Business Metadata

Supports data catalog, tagging, discovery, and compliance (privacy, governance).

Open‑source solutions: Marquez, Apache Atlas, Amundsen, DataHub, Atlan, Alation. Cloud catalogs: AWS Glue Data Catalog, Google Data Catalog, Azure Data Catalog.

Data Consumption

Consumption includes BI queries, data‑science notebooks, real‑time APIs, and metric stores.

Analytical Queries

Modern warehouses (BigQuery, Redshift, Snowflake) and lakehouse engines (Photon, Presto, Dremio) provide fast, interactive SQL.

Data Science

Python notebooks read raw files from slow storage for model training; feature stores and MLOps integrate batch and real‑time features.

Real‑Time Consumption

Results can be pushed to relational databases, KV stores, caches (Redis), or search engines (Elasticsearch). Flink also supports real‑time queries via custom APIs.

Security & Governance

Access control, auditing, encryption, and data‑masking are essential; solutions include cloud IAM, Auth0, Immuta, and network security services (VPC, VPN).

Service‑Layer Products

Metric stores and modeling layers such as LookML, Transform, and Metlo expose unified data services.

Orchestration & ETL

Workflow engines schedule pipelines, handle dependencies, retries, and monitoring.

Orchestration

Tools include Airflow, Cloud Composer, Dagster, Prefect, Flyte, Argo (KubeFlow Pipelines), and DolphinScheduler.

ETL

Low‑code/no‑code platforms (SmartETL, Fivetran, Stitch, Airbyte) simplify transformation; dbt brings software‑engineered data modeling.

Best Practices

Data Layering

Adopt classic warehouse layers (bronze, silver, gold) and lakehouse equivalents to separate raw, cleaned, and production data.

Separate Streaming Ingestion from Streaming Analytics

Use streaming ingestion to feed warehouses for near‑real‑time dashboards; use dedicated streaming analytics for high‑frequency, low‑latency use cases (e.g., gaming).