Designing a Real‑time Data Platform for Modern Data Warehouses
This article explores the evolution from traditional to modern data warehouses, outlines the key capabilities of real‑time data platforms such as data real‑time, virtualization, democratization and collaboration, and presents a comprehensive architecture design with unified collection, streaming, compute and visualization layers, while discussing functional, quality, stability, cost, agility and management considerations.
1. Related Concepts Background
From the perspective of modern data‑warehouse architecture, a real‑time data platform is introduced. Traditional data warehouses (Fig 1) support only T+1 day latency and rely on batch ETL, while modern data warehouses (Fig 2) add diversified data sources, real‑time processing (T+0), and varied consumption patterns.
Figure 1 Traditional Data Warehouse
Figure 2 Modern Data Warehouse
The modern warehouse adds four core capabilities: data real‑time (synchronization and streaming), data virtualization (virtual compute and unified services), data democratization (visual, self‑service), and data collaboration (multi‑tenant, cooperative workflows).
(1) Data Real‑time
Real‑time means end‑to‑end latency from source to consumption in milliseconds, seconds or minutes, covering real‑time extraction, streaming, in‑flight computation, and real‑time loading.
(2) Data Virtualization
Virtualization provides a unified query interface regardless of underlying heterogeneous databases, enabling transparent mixing of data sources.
Figure 4 Data Virtualization
(3) Data Democratization
Ordinary users can use visual interfaces or self‑service SQL to access data without deep technical knowledge, leveraging cloud‑based compute and virtualization.
Typical supporting technologies include data‑virtualization software, data‑federation software, cloud storage, and self‑service BI applications.
(4) Data Collaboration
Both technical and business users can work on the same platform with multi‑tenant isolation, enabling cooperative BI activities.
2. Architecture Design
2.1 Positioning and Goals
The Real‑time Data Platform (RTDP) aims to provide end‑to‑end real‑time processing (millisecond/second/minute latency), multi‑source ingestion, and support for the four capabilities above, lowering development barriers and improving reliability.
2.2 Overall Design Architecture
The conceptual module diagram (Fig 6) shows four unified layers: data collection, streaming processing, compute services, and data visualization, with an open storage layer allowing heterogeneous back‑ends.
Figure 6 RTDP Overall Conceptual Architecture
2.3 Overall Design Idea
The unified abstraction (Fig 7) includes: unified data collection platform, unified streaming processing platform, unified compute service platform, and unified data visualization platform.
Figure 7 Overall Design Idea
(1) Unified Data Collection Platform
Supports full‑load and incremental extraction (e.g., reading database logs), normalizes data into a Unified Message Schema (UMS) that carries namespace and schema information, decoupling messages from physical transports such as Kafka topics.
(2) Unified Streaming Processing Platform
Consumes UMS or JSON messages, offers visual/configurable/SQL‑based development, supports idempotent writes to multiple heterogeneous targets, and provides multi‑tenant isolation.
(3) Unified Compute Service Platform
Implements data virtualization/federation, supports push‑down computation across heterogeneous sources, exposes unified JDBC/REST interfaces and a common SQL dialect, and enables metadata, quality, and security services.
(4) Unified Data Visualization Platform
Combines multi‑tenant user/permission management with visual analytics, facilitating cross‑department collaboration and the “last‑mile” data application.
3. Specific Issues and Considerations
(1) Functional Considerations
Real‑time pipelines can handle certain ETL operators (left join, inter join, union, filter, map, project) but not full‑table aggregations; a hybrid approach mixing streaming and periodic batch compute can cover all complex ETL logic.
Figure 8 Data Processing Architecture Evolution
(2) Quality Considerations
Both Lambda and Kappa architectures ensure eventual consistency; further discussion on hybrid models is deferred to future articles.
(3) Stability Considerations
High availability, SLA guarantees, elastic resilience, comprehensive monitoring, automated operations, and upstream metadata change tolerance are essential.
(4) Cost Considerations
Reducing human, resource, operation, and trial‑and‑error costs through democratization, dynamic resource utilization, and agile development.
(5) Agility Considerations
Agile big‑data emphasizes configuration‑driven, SQL‑driven, and democratized workflows.
(6) Management Considerations
Focuses on unified metadata management and data security across heterogeneous storage, with optional integration to external governance platforms.
In summary, the article presents a complete design blueprint for a Real‑time Data Platform that integrates real‑time, virtualization, democratization, and collaboration capabilities, and outlines the next steps for concrete technology selection and open‑source implementations.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
