Real-Time Data Architecture, Evolution, and Applications at an Online School
The article details the six‑layer big‑data architecture of an online school, chronicles its migration from Storm to Spark Streaming and finally to Flink, and showcases concrete real‑time applications such as gateway monitoring, user‑profile tagging, renewal reporting, and advertising analysis, while outlining future development directions.
Real‑time data analysis is a core component of the online school’s data platform; over the years the team has adopted Storm, Spark Streaming, and currently Apache Flink as the primary streaming frameworks to support scenarios such as log analysis, trace monitoring, system alerts, business KPI tracking, advertising ROI, and live‑stream analytics.
The platform is organized into six layers: data foundation, data collection, data transmission, storage & computation, analysis & visualization, and data application. The foundation layer aggregates core sources (server metrics, client logs, service logs, component logs, system logs). The collection layer uses Maxwell for relational databases, Filebeat for text data, and open‑falcon/metricbeat for server metrics.
Data transmission relies on Kafka for log streams and RabbitMQ for business‑critical messages, with a Kafka cluster of five groups (~30 nodes) and a single RabbitMQ cluster. Storage & computation span raw, intermediate, and result data, employing Storm, Spark, and Flink stacks; the real‑time computation is gradually shifting from Storm to Flink.
During the Storm era, the school built a gateway‑monitoring pipeline where nginx logs are sent as JSON to Kafka, processed by Storm, and the results written to MySQL and Redis for front‑end dashboards. The Storm topology consists of four bolt stages: filtering/cleaning, core calculation, window aggregation, and data persistence.
For user‑profile tagging, Spark Streaming is used alongside offline Spark jobs. The pipeline consumes multiple Kafka topics, merges them into an RDD, performs cleaning, aggregation, field enrichment, and finally writes tags to various databases or back to Kafka for downstream processing, supporting a rule‑engine (Drools) for flexible tag logic.
The first Flink‑based project is a real‑time renewal reporting system. It consumes multiple Kafka streams, updates state in Redis and HBase, triggers metric calculations via Flink map operators, and persists results to MySQL. Custom sink functions for Redis, MySQL, and HBase, as well as utility builders (FlinkEnvBuilder, KafkaClientBuilder), were developed to streamline development.
Another Flink application powers advertising‑placement funnel analysis. It ingests traffic, registration, login, and order data from several Kafka clusters, computes PV/UV, registration counts, login frequencies, and order volumes, stores intermediate results in Pika (a Redis‑compatible store), and pushes aggregated metrics to DingTalk for rapid operational decisions.
Looking ahead, the team plans to deepen business integration, standardize platform processes, introduce SQL‑based self‑service analytics, and build a real‑time data warehouse to further reduce latency and improve development efficiency.
Xueersi Online School Tech Team
The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.