Big Data 21 min read

How Real‑Time Data Warehouses Power Advertising: Architecture, Standards, and Best Practices

This article summarizes Liu Chong's DTCC2022 talk on building a real‑time advertising data warehouse, covering business context, layered model design, development technologies such as Flink and Kafka, full‑link quality assurance, practical implementation details, and future architectural directions.

ITPUB
ITPUB
ITPUB
How Real‑Time Data Warehouses Power Advertising: Architecture, Standards, and Best Practices

Background and Motivation

In the era of real‑time data, advertising data warehouses are shifting from offline batch processing to streaming pipelines, enabling managers and analysts to monitor KPI fluctuations instantly, adjust operations based on market hotspots, and feed real‑time algorithms for better user service.

Advertising Ecosystem Overview

The internet advertising ecosystem consists of three parties: the ad platform, B‑side advertisers, and C‑side users. Mobile app channels dominate revenue (≈89%), while PC and mini‑programs account for the remaining share. Core systems include CRM, delivery, billing, operation, finance, and marketing platforms.

Value of Real‑Time Advertising Data

Real‑time visualization : KPI dashboards show live metrics such as CTR and CPR, crucial during high‑traffic events like Double 11.

Monitoring and diagnosis : Immediate detection of anomalies (e.g., unexpected revenue drops) allows rapid troubleshooting and loss mitigation.

Algorithmic decision‑making : Real‑time bidding, creative optimization, and pricing adjustments empower advertisers lacking dedicated marketing teams.

Real‑Time Data Warehouse Layering

The warehouse follows the same four‑layer structure as offline warehouses:

ODS : Raw event logs from MySQL binlog and Kafka streams.

DWD : Detailed layer for cleaning, filtering, and dimension expansion.

DWS : Summary wide‑table layer improving data usability and metric consistency.

DIM : Real‑time dimension tables that capture minute‑level state changes (e.g., bid status, creative status) unavailable in offline snapshots.

Real‑time warehouse layer diagram
Real‑time warehouse layer diagram

Development Technologies

ODS ingestion relies on Kafka for user‑behavior streams and MySQL binlog for transactional updates. The computation layer primarily uses Flink with SQL support, offering higher productivity than low‑level API development. Additional engines such as Storm (SQL‑enabled), Hive, Spark, and storage solutions like Doris, Blade, and Meituan's internal KV engine Tair are also employed depending on latency and scale requirements.

Quality Assurance Framework

Quality is ensured through three stages:

Pre‑development prevention : Technical stack selection, naming conventions, and task‑level operation standards (grading, fault‑handling procedures).

Mid‑development testing : Extensive test suites covering edge cases, backup link isolation, and peak‑load stress tests (e.g., Double 11 traffic spikes).

Post‑deployment monitoring : Latency SLAs per task priority, anomaly detection for data drops or spikes, and automated alerts for KPI deviations.

End‑to‑End Development Process

The typical workflow includes metric definition, technical solution review, code implementation, logical testing, performance testing, and continuous monitoring. Key deliverables are model specifications, metric naming and alignment with offline counterparts, storage‑engine choices (Flink+Kafka, Doris, Blade, etc.), and defined anomaly thresholds (e.g., revenue variance >10%).

Practical Insights from Meituan

Meituan built an internal real‑time warehouse platform offering code editing, job management, debugging, and parameter tuning (concurrency, memory). The platform also provides data lineage visualization, showing upstream dimension tables and downstream storage formats (JSON, etc.). Visualization dashboards operate at minute‑level granularity, enabling fine‑grained traffic anomaly detection and consumption diagnostics. Intelligent pricing leverages real‑time signals to auto‑adjust bids for millions of small advertisers, reducing budget waste.

Real‑time task platform screenshot
Real‑time task platform screenshot

Future Directions

Current implementations often adopt a Lambda architecture, separating streaming and batch pipelines, which leads to duplicated logic and costly testing. Emerging approaches explore Kappa or hybrid Lambda+Hudi designs to unify storage and computation, improve consistency, and lower operational overhead.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

data pipelinequality assuranceKafkaadvertising analyticsstreaming architecture
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.