What Is a Cloud‑Native Data Platform? Architecture, Components, and Best Practices
This article explores the evolution and architecture of cloud‑native data platforms, covering their historical roots, modern components such as storage layers, ingestion, processing, metadata, and consumption, and offers practical guidance on selecting tools, designing pipelines, and implementing best‑practice strategies for scalable, flexible data infrastructure.
Development History
Early data platforms started with relational databases (RDBMS) for OLTP, evolving to OLAP for analysis. In the 1980s data warehouses emerged, with Bill Inmon's "Building the Data Warehouse" and Ralph Kimball's "The Data Warehouse Toolkit" establishing foundational methodologies. In the 1990s commercial RDBMS such as IBM DB2, SQL Server, Teradata, and Oracle dominated the market.
With the rise of the internet in the early 2000s, "big data" challenges appeared. Classic MapReduce, GFS, and BigTable papers introduced a distributed data system stack distinct from traditional RDBMS. Hadoop popularized this stack (HDFS, YARN, HBase, Hive) and sparked a decade of ecosystem growth, including data lakes, NoSQL, and SQL‑on‑Hadoop solutions.
Cloud computing, led by AWS, exposed Hadoop's limitations—tight coupling of storage and compute, high operational cost, and poor support for streaming. The cloud‑native era pushes for managed services, elasticity, and pay‑per‑query models, prompting the emergence of new cloud‑native data platforms.
Data Platform Architecture
Classic Data Warehouse Architecture
Traditional data platforms rely on ETL tools to load source data into a warehouse, then expose services via SQL.
Data Lake Architecture
Data lakes store diverse semi‑structured and unstructured data (JSON, Avro, ProtoBuf, text, images, audio, video). They require more flexible processing beyond pure SQL.
Cloud‑Native Architecture
Cloud‑native platforms combine SaaS components, managed services, and pay‑as‑you‑go storage. They integrate data‑lake and warehouse concepts (lakehouse) and support both batch and streaming workloads.
a16z's unified data architecture diagram illustrates the data flow from source, ingestion, storage, query processing, transformation, to analysis, allowing selective component adoption.
Data Acquisition
Platforms must support both batch and streaming ingestion.
Batch Acquisition
Typical sources include files, FTP, or APIs; data is fetched on a schedule, providing near‑real‑time updates if the schedule is frequent.
Streaming Acquisition
Streaming is essential for low‑latency use cases such as recommendation or fraud detection. CDC captures real‑time changes from databases, avoiding loss of intermediate state.
Recommended path: first build stable batch ingestion, then add streaming acquisition, and finally streaming processing (e.g., Flink).
Requirements & Products
Plugin architecture for many source types.
Operational observability and error handling.
Performance and stability for large volumes.
Products include cloud services (AWS Glue, Google Cloud Data Fusion, Azure Data Factory), SaaS (Fivetran, Stitch, Airbyte), open‑source (Apache NiFi, Kafka, Pulsar, Debezium), and serverless custom solutions.
Data Storage
Storage is split into slow (object‑store‑based lake) and fast (real‑time stores).
Slow Storage
Object storage (S3, GCS, ABS) forms the data lake; lakehouse adds metadata, schema evolution, and transactions (Delta, Iceberg).
Fast Storage
Fast stores support low‑latency queries (ClickHouse, KV stores) or streaming system storage (Kafka tiered storage).
Requirements & Products
Reliability (no data loss).
Scalability.
Performance (high throughput for slow, low latency for fast).
Typical choices: cloud vendor services, or open‑source projects like lakeFS, JuiceFS, SeaweedFS.
Data Processing
Batch Processing
Apache Spark is the dominant engine; Hive, Presto, Dremio also serve SQL workloads.
Streaming Processing
Apache Flink, Spark Streaming, and Kafka Streams handle real‑time pipelines; results are often written to analytical databases (ClickHouse, Pinot) or search stores.
Unified Batch‑Streaming (Flow‑Batch‑One)
Flink treats batch as a special case of streaming; Apache Beam provides a unified DSL that can run on Spark or Flink.
Requirements & Products
Horizontal scalability.
Stability and failover.
Open APIs (SQL, SDKs).
Managed services include AWS EMR, Google Dataproc, Azure Databricks, Kinesis Data Analytics, Cloud Dataflow, Confluent, Upsolver, Materialize.
Metadata
Metadata includes platform configuration, lineage, schema, statistics, and operational logs, enabling monitoring, debugging, and governance.
Platform Metadata
Tracks data source configs, job status, schema evolution, and resource usage.
Business Metadata
Supports data catalog, tagging, discovery, and compliance (privacy, governance).
Open‑source solutions: Marquez, Apache Atlas, Amundsen, DataHub, Atlan, Alation. Cloud catalogs: AWS Glue Data Catalog, Google Data Catalog, Azure Data Catalog.
Data Consumption
Consumption includes BI queries, data‑science notebooks, real‑time APIs, and metric stores.
Analytical Queries
Modern warehouses (BigQuery, Redshift, Snowflake) and lakehouse engines (Photon, Presto, Dremio) provide fast, interactive SQL.
Data Science
Python notebooks read raw files from slow storage for model training; feature stores and MLOps integrate batch and real‑time features.
Real‑Time Consumption
Results can be pushed to relational databases, KV stores, caches (Redis), or search engines (Elasticsearch). Flink also supports real‑time queries via custom APIs.
Security & Governance
Access control, auditing, encryption, and data‑masking are essential; solutions include cloud IAM, Auth0, Immuta, and network security services (VPC, VPN).
Service‑Layer Products
Metric stores and modeling layers such as LookML, Transform, and Metlo expose unified data services.
Orchestration & ETL
Workflow engines schedule pipelines, handle dependencies, retries, and monitoring.
Orchestration
Tools include Airflow, Cloud Composer, Dagster, Prefect, Flyte, Argo (KubeFlow Pipelines), and DolphinScheduler.
ETL
Low‑code/no‑code platforms (SmartETL, Fivetran, Stitch, Airbyte) simplify transformation; dbt brings software‑engineered data modeling.
Best Practices
Data Layering
Adopt classic warehouse layers (bronze, silver, gold) and lakehouse equivalents to separate raw, cleaned, and production data.
Separate Streaming Ingestion from Streaming Analytics
Use streaming ingestion to feed warehouses for near‑real‑time dashboards; use dedicated streaming analytics for high‑frequency, low‑latency use cases (e.g., gaming).
Control Cloud Costs
Monitor resource usage, design hot‑cold storage, and optimize data partitioning to avoid unnecessary expenses.
Avoid Tight Coupling
Design loosely coupled interfaces, encapsulate vendor‑specific APIs, and keep components replaceable.
Data Platform Construction
Business Value
Data platforms drive efficiency, revenue growth, innovation, and compliance.
Construction Path
Follow an agile, scenario‑driven roadmap: self‑service BI → metric layers → automated pipelines → AI‑enhanced insights → action‑oriented APIs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
