How Autohome Built a Flink‑StarRocks Real‑Time Ad Data Warehouse
This article details Autohome's transition from an hourly offline ad data warehouse to a Flink‑StarRocks real‑time architecture, covering background, engine and storage selection, multi‑layer design, implementation steps, encountered issues, monitoring strategies, and future roadmap to achieve second‑level data freshness and high accuracy.
Background
Autohome, a leading automotive website, generates massive advertising effect data daily (requests, impressions, viewable impressions, clicks). The existing offline data warehouse, built since 2015, could not meet the growing need for low‑latency, high‑frequency analytics required for risk control and fine‑grained operations.
Current Offline Warehouse Architecture
The offline warehouse standardizes ad data and provides OLAP access for business analysts. Initially, only hourly data was produced because real‑time demand was low and streaming technologies were not mature.
Real‑Time Warehouse Technical Architecture
After evaluating Storm, Spark Streaming, and Flink, the team chose Flink for its Exactly‑once guarantees and native processing mode. Storage engine candidates (ClickHouse, StarRocks, TiDB, Iceberg) were compared; StarRocks was selected for its performance, SQL support, and lower operational cost.
Storage Engine Selection
Iceberg was discarded due to minute‑level latency.
TiDB lacked pre‑aggregation and had weaker indexing, leading to higher cluster load.
ClickHouse lacked full SQL support and had higher maintenance overhead.
StarRocks offered second‑level latency, robust SQL, and efficient aggregation, making it the final choice.
Real‑Time Warehouse Layer Design
The OneData methodology was applied, dividing the warehouse into four layers:
ODS (Source Layer): Raw logs and MySQL binlog are ingested into Kafka.
DWD (Detail Layer): Flink enriches, joins, and aggregates data, writing results to StarRocks.
DWA (Aggregation Layer): ETL creates wide tables or metric aggregates from DWD.
APP (Application Layer): Business dashboards consume the aggregated data for reporting.
Advertising Effect Application
The end‑to‑end flow includes server‑side request logs, client‑side viewable impression and click logs, Kafka ingestion, Flink cleaning and aggregation, and finally storage in StarRocks for OLAP visualization.
Flink Development Detailed Process
ODS Development: Define Kafka tables in Flink DDL for click, exposure, and viewable exposure events.
DWD Development: Define StarRocks tables via Flink DDL, enable dynamic partitioning on the dt field, and configure sorting keys.
Data Cleaning Rules: Filter invalid records by pv_id and filter, merge click and viewable tables, then join with exposure data to produce a complete detail table written to StarRocks.
DWA Development: Create materialized views in StarRocks to automatically maintain aggregated results, reducing query latency.
APP Development: Build ad‑plan and ad‑slot analysis datasets in the OLAP platform, exposing metrics such as request volume, click‑through rate, and conversion rate.
Issues and Solutions
High concurrency JSON sink caused failures; switching to CSV reduced size.
Complex StarRocks views sometimes failed; rebuilding or upgrading to version 2.1.8 resolved the instability.
Latency window for joining click and exposure data required a 4‑hour cache to achieve >95% accuracy.
Service Stability Assurance
Monitoring includes:
Kafka‑connectors consumption rate and lag (Prometheus + Grafana).
Flink job latency, restarts, checkpoint failures, and health checks.
StarRocks server memory, disk I/O, CPU usage, and cluster status.
Summary and Future Plans
With the Flink‑StarRocks real‑time framework, data freshness improved from hourly to seconds, processing over 100 k records per second with >95% accuracy. Future work will explore StarRocks external tables for heterogeneous queries and continue community engagement to further accelerate the ad data pipeline.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
