Building a Real‑Time Data Warehouse with Flink: Architecture, Implementation and Lessons Learned
This article describes how a fast‑growing company built a layered real‑time data warehouse on Flink, detailing the evolution from a simple 1.0 pipeline to a 2.0 architecture with ODS, DWD and ADS layers, dimension joins, exactly‑once sinks, HDFS partitioning, monitoring, and future improvements.
As user‑growth and product operations demand increasingly real‑time data, the company migrated from a simple 1.0 real‑time architecture to a more robust 2.0 design built on Apache Flink.
Real‑time Warehouse 1.0 relied on a single pipeline that quickly satisfied early needs but suffered from high development cost, missing data models, and limited scalability as use cases multiplied.
Real‑time Warehouse 2.0 introduces a three‑layer model inspired by offline warehouses:
ODS layer : stores raw app event logs and various logs.
DWD (detail) layer : splits data into a public dimension layer and business‑specific layers, producing detailed behavior tables for each business (e.g., "Tongzhen", "Buluo").
ADS layer : provides ad‑hoc query capability by sinking wide tables into ClickHouse and serves real‑time dashboards from aggregated metrics stored in a proprietary KV store (wtable).
The technical stack includes Flink for stream processing, Kafka for messaging, ClickHouse for ad‑hoc queries, HDFS for batch storage, and a custom KV system (wtable) for fast look‑ups.
Implementation details are organized into five parts:
DWD construction : six functions – formatting, anti‑fraud filtering, deduplication, IMEI‑ID service, Kafka exactly‑once sink, and stream splitting.
Dimension join : three approaches – loading dimension tables in RichFlatMapFunction, caching dimensions in external stores (Redis, wtable, HBase), and using broadcast state for incremental updates, each with pros and cons.
Flink sink to ClickHouse : a JDBC‑based sink with checkpoint‑driven exactly‑once semantics.
Asynchronous batch load to HDFS : uses BucketingSink (Hadoop 2.6) with 10‑minute partitions, configurable batch rollover interval and size.
sink.setBatchRolloverInterval(1 * 60 * 1000L);</code>
<code>sink.setBatchSize(1024 * 1024 * 256L);Task and data‑quality monitoring : real‑time metrics (QPS, GC, CPU, lag, memory) are visualized in the internal Wstream platform; data quality monitors integrity, accuracy, consistency, and timeliness at one‑minute granularity.
Key outcomes include unified ODS/DWD/DWS layers that align offline and online warehouses, reduced Flink resource consumption through stream splitting, faster ad‑hoc analysis via ClickHouse, and shortened development cycles from weeks to days.
Future work will focus on improving lineage tracking, expanding streaming‑SQL adoption, enhancing checkpoint monitoring, strengthening job robustness, and further automating performance testing.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
