How Zhejiang Mobile Scaled Billion‑Level Real‑Time Stream Processing with Storm
This article details Zhejiang Mobile's architecture and practical experience in building a billion‑scale real‑time stream computing platform using Storm, Kafka, Flume, and Redis, covering use cases, system design, performance bottlenecks, optimization techniques, and monitoring strategies.
Overview of Real‑Time Stream Computing at Zhejiang Mobile
In a presentation at the 2016 DAMS China Data Asset Management Summit, Kang Zuling, a big‑data architect at Zhejiang Mobile, shared the motivations, scale, and architecture of their real‑time stream processing platform, which handles up to 1200 billion records (≈50 TB) across 20 nodes.
Why Operators Need Stream Computing
Two primary use cases were highlighted:
Network monitoring: Detecting rapid changes in base‑station traffic to predict faults or capacity overloads.
Location‑based services: Providing near‑real‑time population density maps by leveraging the operator’s mandatory device‑to‑tower reporting, even when users disable mobile data.
Key Big‑Data Applications Demonstrated
Three concrete examples illustrate the platform’s capabilities:
Bidirectional fault boundary detection.
High‑speed rail network simulation to pinpoint weak coverage segments.
Virus‑SMS detection using data analysis and machine‑learning techniques.
Platform Architecture
The core architecture is a generic Storm‑based pipeline:
Data is collected at the front end, distributed via a Kafka cluster, and then split into two streams: one feeding the Storm computation layer and the other stored directly in HDFS.
Benefits of Kafka in the Pipeline
Kafka serves as a high‑throughput data‑distribution hub, supporting seamless integration with various stream‑processing frameworks.
Its low latency (near‑zero) can be achieved with as few as three to six Kafka nodes, depending on scale.
Minimal changes are required to integrate with existing Hadoop components such as Flume and HDFS.
Scale and Performance Metrics
The deployed cluster consists of 20 nodes, each with 24 CPU cores, handling roughly 1.2 trillion records (≈50 TB). This volume is comparable to Tencent’s Spark Streaming workload, confirming the architecture’s ability to support the largest data‑intensive applications.
Technical Choices and Architecture Details
Storm’s master‑worker model (Nimbus and Supervisors) relies on Zookeeper for state coordination. Data enters through Spout components, is processed by Bolt s, and emitted as tuples. The platform adopts an “at‑least‑once” guarantee, but switches to incremental updates when hardware constraints prevent the extra 30 % overhead.
Common Pitfalls and Solutions
Flume vs. sink bottleneck: By inspecting Channel error logs, the team identified upstream congestion and resolved it by persisting data to disk and parallelizing processing.
Kafka traffic trap: Monitoring revealed network‑card saturation; moving from a dual‑NIC to a four‑NIC configuration raised throughput from ~190 MB/s to 250‑300 MB/s.
Result update strategy: Incremental updates reduced potential data loss from 50 % (full overwrite) to about 5 % when a Bolt failed.
Data skew: Fine‑grained aggregation (e.g., per‑cell instead of per‑city) mitigated uneven load and prevented partial result loss.
Redis throughput limits: When update rates exceeded ~500 k per minute, the single‑threaded Redis became a bottleneck; sharding the data across multiple Redis instances solved the issue.
Latency Measurement
Two latency components are tracked: (1) the delay from Flume ingestion to Kafka entry, and (2) end‑to‑end latency from Flume input to final Storm output. The team injects special test records at the Flume edge, reads them back via Storm, and computes a statistical distribution (baseline ≈15 s, with tails up to 35 s).
Data Cleaning and Filtering
Rather than discarding malformed records outright, the pipeline tolerates minor data quality issues, allowing imperfect records to pass if they can still contribute to the final result.
Active User Counting per Cell
To avoid memory explosion when tracking per‑user IDs, the solution employs probabilistic data structures. After evaluating Bloom Filters and HyperLogLog, the team selected Bloom Filters for their lower error rate despite higher resource consumption.
Monitoring the Stream Computing System
Beyond traditional host‑level metrics, the monitoring stack focuses on data backlog detection, using custom dashboards and alerts to surface accumulation in Kafka, Storm, or Redis pipelines.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
