Architecture and Technical Implementation of the WMDA Data Analytics Platform
The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.
WMDA is a self‑developed user behavior analytics product that supports both zero‑event (no‑code) and manual event data collection across PC, mobile, app, and mini‑programs via SDKs.
The platform’s architecture follows a standard data analysis model divided into five layers: data collection, data transmission, data modeling/storage, data statistics/analysis, and data visualization.
Data collection uses SDKs for various front‑ends; transmission enriches, filters, and formats data before sending it to Flume and Kafka for real‑time and batch buses.
In the modeling/storage layer, ETL processes clean and format data, storing it in HDFS while streaming data is fed to Spark Streaming for real‑time analysis.
Statistical analysis combines Spark Streaming for real‑time insights and an offline pipeline where Hive, Kettle, and a suite of sub‑systems (OLAP, Bitmap, clustering, intelligent path) compute dashboards, funnels, retention, and segmentation.
The real‑time analysis system employs Spark Streaming with a 5‑second batch interval and Druid for fast OLAP queries, broadcasting configuration as variables and ingesting results into Druid via Kafka.
The offline system leverages HDFS as a data lake, Hive for core ETL, Spark for event matching and cleaning, and a cluster of sub‑systems (OLAP, Bitmap, clustering, intelligent path) orchestrated by Kettle and executed by TaskServer.
Kettle, a Java‑based open‑source ETL tool, provides visual job and transformation design, supports scheduling, and integrates with TaskServer through JobEntry and JobEntryDialog interfaces for task execution and parameter handling.
TaskServer is a high‑availability, horizontally scalable distributed execution framework built on a Master‑Slave pattern, comprising JobTracker, TaskTracker, TaskQueue, and Zookeeper for coordination, ensuring reliable offline task processing.
Druid serves as the OLAP engine, with Real‑Time Nodes handling streaming data, Historical Nodes storing segment data, Coordinator Nodes managing segment metadata, and Broker Nodes acting as query gateways, all integrated with HDFS as deep storage.
Bitmap computation accelerates funnel, retention, and segmentation analysis by generating user‑level bitmaps via MapReduce and querying them through a Bitmap engine.
The article concludes that WMDA’s architecture demonstrates a comprehensive big‑data solution, but optimal designs should be tailored to specific business needs and evolve with growth.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.