Big Data 12 min read

How Mogu’s Advertising Platform Built a Real‑Time Data Pipeline with Storm, Flink, and Kylin

This article explains how Mogu’s advertising system designs and evolves a real‑time data pipeline—covering merchant and operation needs, data collection, cleaning, processing with Storm, Flink, and Kylin, and service guarantees—to enable high‑quality, low‑latency analytics for advertisers and the platform.

21CTO

Aug 20, 2019

How Mogu’s Advertising Platform Built a Real‑Time Data Pipeline with Storm, Flink, and Kylin

Merchant Side Requirements

Merchants need highly accurate and consistent data.

Provide rich tools for merchants to select high‑performing products for placement and decision‑making.

Advertising is a real‑time bidding system; merchants require ultra‑low latency data for commercial decisions.

Operation Side Requirements

Rapidly satisfy diverse business‑level data analysis needs.

Offer rich tools for operators to understand site‑wide and merchant data from multiple dimensions, enhancing data‑driven operations.

Data System Overview

The advertising data system focuses on four core tasks: data collection, data cleaning/layering, data processing, and data service.

1. Data Collection

Advertising involves transaction, user, and merchant behaviors generated by various systems with heterogeneous storage and schemas. The system must integrate these data sources to serve both merchants and the platform.

Key data categories:

System data: Collect change logs from business systems (e.g., binlog/Kafka) and normalize them for merchant behavior analysis.

Metric data: Capture clicks, impressions, and other user actions via a unified SDK; also ingest external data (e.g., add‑to‑cart, transactions) through Kafka and Corgi.

Monitoring data: Unified scene‑based reporting across the site, with Sentry‑based traffic monitoring and real‑time alerts.

Performance data: Gather third‑party call performance metrics (KV, Elasticsearch, HBase, etc.) using a shared component library.

2. Data Cleaning / Layering

During collection, data from numerous external systems have varied structures and integrity constraints. The challenge is to ensure semantic consistency across processing stages and to support business‑driven data models.

Two main problems were identified:

Early‑stage systems parsed and structured raw data per business need, leading to tight coupling and low reusability when granularity requirements changed.

Original pipelines could not accommodate new dimensions such as product add‑to‑cart, likes, or favorites, requiring a more flexible architecture.

3. Data Processing

The processing stack evolved through four phases:

Phase 1: Simple Java jobs and Hive‑based T+1 batch jobs for CPC/CPS ads; faced issues like unexpected restarts, OOM, and scaling.

Phase 2: Introduced an APP monitor agent for health checks; data volume remained manageable.

Phase 3: Added diverse ad formats and commercial logic, leading to many single‑node stream apps with inconsistent disaster‑recovery logic.

Phase 4: Selected Storm as the first stream framework (ease of use, JStorm source familiarity). Later, Flink was adopted for its SQL support, exactly‑once semantics, and unified batch‑stream processing.

To unify development, the "Anqila" project was created on top of Storm and Flink, providing a rich component library, unified monitoring, and data model definitions for rapid data production.

In later stages, Kylin was introduced to build a near‑real‑time OLAP cube, enabling multi‑dimensional queries with second‑level response times, while still using Storm+Flink for merchant‑side low‑latency data.

4. Data Service

Integrity: Ensure every record is processed, supporting at‑least‑once semantics.

Accuracy: Implement full‑process pipelines; some components achieve exactly‑once for precise downstream consumption.

Consistency: Maintain uniform field definitions across the pipeline via a shared public layer.

Timeliness: Provide second‑level data for merchant decision‑making (exposure data at minute‑level, others at second‑level).

Conclusion

The article outlines the evolution of Mogu’s advertising data pipeline, from early batch jobs to a sophisticated real‑time streaming architecture using Storm, Flink, and Kylin, and discusses how the system ensures data integrity, accuracy, consistency, and timeliness to support data‑driven advertising operations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Advertising Big Data Flink stream processing Real-time Data Storm Kylin

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.