Big Data 14 min read

Real-Time Data Architecture, Evolution, and Applications at an Online School

The article details the six‑layer big‑data architecture of an online school, chronicles its migration from Storm to Spark Streaming and finally to Flink, and showcases concrete real‑time applications such as gateway monitoring, user‑profile tagging, renewal reporting, and advertising analysis, while outlining future development directions.

Xueersi Online School Tech Team
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Real-Time Data Architecture, Evolution, and Applications at an Online School

Real‑time data analysis is a core component of the online school’s data platform; over the years the team has adopted Storm, Spark Streaming, and currently Apache Flink as the primary streaming frameworks to support scenarios such as log analysis, trace monitoring, system alerts, business KPI tracking, advertising ROI, and live‑stream analytics.

The platform is organized into six layers: data foundation, data collection, data transmission, storage & computation, analysis & visualization, and data application. The foundation layer aggregates core sources (server metrics, client logs, service logs, component logs, system logs). The collection layer uses Maxwell for relational databases, Filebeat for text data, and open‑falcon/metricbeat for server metrics.

Data transmission relies on Kafka for log streams and RabbitMQ for business‑critical messages, with a Kafka cluster of five groups (~30 nodes) and a single RabbitMQ cluster. Storage & computation span raw, intermediate, and result data, employing Storm, Spark, and Flink stacks; the real‑time computation is gradually shifting from Storm to Flink.

During the Storm era, the school built a gateway‑monitoring pipeline where nginx logs are sent as JSON to Kafka, processed by Storm, and the results written to MySQL and Redis for front‑end dashboards. The Storm topology consists of four bolt stages: filtering/cleaning, core calculation, window aggregation, and data persistence.

For user‑profile tagging, Spark Streaming is used alongside offline Spark jobs. The pipeline consumes multiple Kafka topics, merges them into an RDD, performs cleaning, aggregation, field enrichment, and finally writes tags to various databases or back to Kafka for downstream processing, supporting a rule‑engine (Drools) for flexible tag logic.

The first Flink‑based project is a real‑time renewal reporting system. It consumes multiple Kafka streams, updates state in Redis and HBase, triggers metric calculations via Flink map operators, and persists results to MySQL. Custom sink functions for Redis, MySQL, and HBase, as well as utility builders (FlinkEnvBuilder, KafkaClientBuilder), were developed to streamline development.

Another Flink application powers advertising‑placement funnel analysis. It ingests traffic, registration, login, and order data from several Kafka clusters, computes PV/UV, registration counts, login frequencies, and order volumes, stores intermediate results in Pika (a Redis‑compatible store), and pushes aggregated metrics to DingTalk for rapid operational decisions.

Looking ahead, the team plans to deepen business integration, standardize platform processes, introduce SQL‑based self‑service analytics, and build a real‑time data warehouse to further reduce latency and improve development efficiency.

analyticsdata pipelineFlinkReal-time Streamingbig data architectureSpark Streamingstorm
Xueersi Online School Tech Team
Written by

Xueersi Online School Tech Team

The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.