Big Data 11 min read

Real-Time Data Warehouse Architecture Using Flink: Design, Implementation, and Challenges

This article details the design and implementation of a real‑time data warehouse for an advertising platform, covering business background, challenges, a Lambda‑based architecture, Flink stream processing setup, ETL logic, sink handling, and performance results, concluding with future improvement directions.

Zhuanzhuan Tech

Aug 24, 2022

Real-Time Data Warehouse Architecture Using Flink: Design, Implementation, and Challenges

1. Business Background

Data Center is a core module of the commercial platform; advertisers care about ad effectiveness, which for a second‑hand e‑commerce platform means helping merchants sell goods effectively. Data warehouse construction is the foundation of the Data Center, enabling real‑time, accurate, and stable output of effectiveness data.

Advertising data flow:

The entire chain spans multiple business lines. From a data perspective, advertisers directly experience increased exposure, followed by product views that meet user needs, then inquiries (e.g., shipping, price), negotiation, order confirmation, and payment. This generates six core metrics: exposure, click, billing, consultation, order, and payment.

Before Flink, real‑time effectiveness metrics were calculated offline with hour granularity, incurring a two‑hour latency that was unfriendly to the business.

2. Challenges in Data Warehouse Construction

How to conveniently aggregate data from multiple sources. Advertising drives revenue; how to ensure data accuracy, stability, and real‑time capability. Need framework‑level reusability and easy extensibility. When anomalies occur, the system must fail fast, alert, allow easy re‑run, and maintain data consistency. Must support both real‑time and offline query scenarios.

3. Data Warehouse Implementation Plan

3.1 Overall Data Warehouse Solution

3.1.1 Architecture Selection

With the evolution of big‑data applications, mature data‑warehouse architectures such as Lambda and Kappa have emerged. Considering the high cost of historical data replay in Kappa and the platform's capabilities, the Lambda architecture was chosen. Lambda consists of three layers:

Batch layer: Generates views from full historical data; slow but can fix almost all problems.

Speed layer: Processes new big data in real time; low latency, results usable immediately, though may be less complete or accurate than batch results, which replace the view after batch finishes.

Serving layer: Responds to queries and returns results.

With this layered separation, the overall warehouse has a clear structure: the offline warehouse handles the batch layer, the real‑time warehouse handles the speed layer, and the data service provides the query interface.

3.1.2 Implementation Choice

Reports are divided into internal decision‑making and ToB (advertiser) scenarios. Initially, both were considered for OLAP engines, but due to operational cost and instability, two solutions were adopted: internal decisions use an OLAP analyzer, while ToB uses custom logic with MySQL and Redis storage.

Additionally, ToB queries are handled flexibly: after a day‑cross, yesterday's offline data output continues to use yesterday's real‑time data, with a reminder shown on the page.

3.2 Application of Flink in Real‑Time Data Warehouse

This section focuses on the Flink implementation. Flink, as a third‑generation stream‑processing framework, offers higher throughput than Storm, lower latency than Spark Streaming (milliseconds vs. seconds), supports terabyte‑scale state management, and flexible time windows.

3.2.1 Create Execution Environment

Configure checkpoint interval, execution mode, timeout, and state backend (using HDFS).

3.2.2 Define Data Source (DataSource)

The business mainly processes two kinds of metrics: basic metrics and effect metrics.

Basic metrics include exposure count, click count, and cost.

Effect metrics represent conversions for the client, i.e., post‑click downstream metrics such as number of messages (direct/indirect), private messages (direct/indirect), orders (direct/indirect), payments (direct/indirect), and GMV (direct/indirect).

Basic metrics are calculated directly from the stream data. Effect metrics require attribution rules and cross‑stream computation. Attribution has two types:

Direct effect: For the same product, a click and conversion occurring on the same day.

Indirect effect: A conversion occurring after a click on the merchant’s ad within the past 14 days.

1) Basic metric data source – click and billing share a single topic in one stream.

2) Effect metric and click data source are merged; the merged stream shares state for computation.

3.2.3 ETL

According to the dimensions required by the serving layer, key bucketing and reduction are built. The focus here is on UV and effect‑metric calculations.

3.2.3.1 UV Calculation

Using Flink’s state management, results are kept in state to ensure exactly‑once semantics even when checkpoints are re‑run.

3.2.3.2 Effect‑Metric Calculation

1) Custom dual‑stream processing function (Flink supports joining up to two streams). Two states are declared: click state (retained for 14 days to compute indirect effects) and effect‑metric state (retained for 3 hours to handle delayed effect data).

2) Process click data stream.

3) Process effect data stream.

3.2.4 Define Sink Time Window

A tumbling window based on processing time is used, emitting results every 10 seconds.

3.2.5 Sink

1) Custom sink processor.

2) Since all computation stays in Flink memory, the output does not require additional aggregation; Redis storage uses Hash.hmset to overwrite existing entries.

3.2.6 Overall DAG

The job processes 27 million records per second, maintains a state size of 6 GB, and has run stably for one and a half years.

4. Conclusion

Flink perfectly addresses the business’s real‑time pain points, reducing latency from two hours to two minutes, and its powerful API is highly extensible. Data‑warehouse construction is an ongoing iterative process: the real‑time warehouse solves timeliness, while the offline warehouse ensures data completeness and accuracy. The current solution fits the existing environment, though double computation remains; future work may explore a hybrid mode, applying Kappa for non‑core metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink data processing Streaming ETL Real-Time Data Warehouse Lambda architecture

Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.