Mastering Offline and Real-Time Data Warehouses: A Backend Engineer’s Guide
Backend developers need to understand both offline and real-time data warehouses; this guide explains data collection, layering, partitioning, typical use cases, archiving strategies, and how to build a real-time warehouse with Flink, covering practical steps, examples, and key considerations for efficient data processing.
Backend developers often encounter data warehouses alongside business code, middleware, and storage. This article summarizes key concepts of offline and real-time data warehouses from a backend perspective.
Offline Data Warehouse
Offline warehouses handle historical data for statistical analysis. Business data, monitoring events, and logs are collected offline, stored in Hive tables, and queried with SQL.
Data Collection
Business data: usually stored in DB or HBase, imported into Hive tables as batch loads.
Monitoring events: backend services emit events, which are consumed, parsed, and loaded into Hive.
Logs: collected via Filebeat, parsed, and stored in Hive.
Data Layering
ODS (Operational Data Store): raw data, no processing. DWD (Data Warehouse Detail): cleaned and flattened data, e.g., JSON parsing. DWS (Data Warehouse Summary): aggregated wide tables for business consumption. ADS (Application Data Service): ready‑to‑use reporting data.
This layering mirrors typical backend code architecture (interface, service, repository) and isolates changes to each layer, reducing impact on downstream consumers.
Data Partitioning
Daily incremental tables: records changed on a specific day.
Daily full tables: snapshots of all records for a day.
Hourly incremental tables: records changed within an hour.
Because of large volumes, not all offline tables are retained indefinitely; for example, ODS daily increments may be kept only for the last N days, while DWD daily full tables merge increments to provide a complete history.
Offline Use Cases
Offline statistics: complex joins and aggregations via Hive SQL, which translate to MapReduce jobs on HDFS.
Data reconciliation: verify consistency between business parties (see the linked article on real‑time vs. offline reconciliation).
Back‑end historical data refresh: extract IDs from offline analysis to drive back‑end data updates.
Archiving Online Data
When online data grows and is no longer accessed, it must be archived and deleted. Ensure data is first captured offline before deletion. A common pattern is adding an archive_status column (default 0, set to 1 after archiving) so that Hive records with status 1 indicate safe deletion of the corresponding online rows.
Real-Time Data Warehouse
Real‑time warehouses complement offline ones by providing sub‑second latency, enabling online services to query fresh results directly.
Construction Process
Real‑time ingestion: capture MySQL binlog or business events, stream them into Flink.
Data computation: use Flink to join multiple ODS tables, produce DWD tables, and continuously aggregate into DWS tables.
Export & query interface: write results to real‑time stores such as Doris, Hologres, MySQL, HBase, or Redis, and expose RPC/SQL interfaces for backend services.
Reference implementation diagram (originally from a cloud provider) is shown below:
Real-Time Use Cases
Internal reporting queries.
External statistical queries such as leaderboards, view counts, add‑to‑cart, favorites, and purchase metrics.
User personalization and recommendation.
Understanding these concepts equips backend engineers with the knowledge to design, implement, and maintain both offline and real‑time data warehouses effectively.
Java Baker
Java architect and Raspberry Pi enthusiast, dedicated to writing high-quality technical articles; the same name is used across major platforms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
