Big Data 11 min read

Architecture and Technical Implementation of the WMDA Data Analytics Platform

The article details WMDA's end‑to‑end data analytics architecture, covering zero‑event data collection, real‑time and offline processing pipelines built on Spark Streaming, Druid, Hadoop, Kettle, and TaskServer, and explains how these components collaborate to deliver comprehensive user behavior analysis.

58 Tech

Sep 6, 2019

Architecture and Technical Implementation of the WMDA Data Analytics Platform

WMDA is a self‑developed user behavior analytics product that supports both zero‑event (no‑code) and manual event data collection across PC, mobile, app, and mini‑programs via SDKs.

The platform’s architecture follows a standard data analysis model divided into five layers: data collection, data transmission, data modeling/storage, data statistics/analysis, and data visualization.

Data collection uses SDKs for various front‑ends; transmission enriches, filters, and formats data before sending it to Flume and Kafka for real‑time and batch buses.

In the modeling/storage layer, ETL processes clean and format data, storing it in HDFS while streaming data is fed to Spark Streaming for real‑time analysis.

Statistical analysis combines Spark Streaming for real‑time insights and an offline pipeline where Hive, Kettle, and a suite of sub‑systems (OLAP, Bitmap, clustering, intelligent path) compute dashboards, funnels, retention, and segmentation.

The real‑time analysis system employs Spark Streaming with a 5‑second batch interval and Druid for fast OLAP queries, broadcasting configuration as variables and ingesting results into Druid via Kafka.

The offline system leverages HDFS as a data lake, Hive for core ETL, Spark for event matching and cleaning, and a cluster of sub‑systems (OLAP, Bitmap, clustering, intelligent path) orchestrated by Kettle and executed by TaskServer.

Kettle, a Java‑based open‑source ETL tool, provides visual job and transformation design, supports scheduling, and integrates with TaskServer through JobEntry and JobEntryDialog interfaces for task execution and parameter handling.

TaskServer is a high‑availability, horizontally scalable distributed execution framework built on a Master‑Slave pattern, comprising JobTracker, TaskTracker, TaskQueue, and Zookeeper for coordination, ensuring reliable offline task processing.

Druid serves as the OLAP engine, with Real‑Time Nodes handling streaming data, Historical Nodes storing segment data, Coordinator Nodes managing segment metadata, and Broker Nodes acting as query gateways, all integrated with HDFS as deep storage.

Bitmap computation accelerates funnel, retention, and segmentation analysis by generating user‑level bitmaps via MapReduce and querying them through a Bitmap engine.

The article concludes that WMDA’s architecture demonstrates a comprehensive big‑data solution, but optimal designs should be tailored to specific business needs and evolve with growth.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data pipeline ETL Druid Spark Streaming Kettle

Written by

58 Tech

Official tech channel of 58, a platform for tech innovation, sharing, and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.