Big Data 20 min read

How Zhejiang Mobile Scaled Billion‑Level Real‑Time Stream Processing with Storm

This article details Zhejiang Mobile's architecture and practical experience in building a billion‑scale real‑time stream computing platform using Storm, Kafka, Flume, and Redis, covering use cases, system design, performance bottlenecks, optimization techniques, and monitoring strategies.

dbaplus Community
dbaplus Community
dbaplus Community
How Zhejiang Mobile Scaled Billion‑Level Real‑Time Stream Processing with Storm

Overview of Real‑Time Stream Computing at Zhejiang Mobile

In a presentation at the 2016 DAMS China Data Asset Management Summit, Kang Zuling, a big‑data architect at Zhejiang Mobile, shared the motivations, scale, and architecture of their real‑time stream processing platform, which handles up to 1200 billion records (≈50 TB) across 20 nodes.

Why Operators Need Stream Computing

Two primary use cases were highlighted:

Network monitoring: Detecting rapid changes in base‑station traffic to predict faults or capacity overloads.

Location‑based services: Providing near‑real‑time population density maps by leveraging the operator’s mandatory device‑to‑tower reporting, even when users disable mobile data.

Key Big‑Data Applications Demonstrated

Three concrete examples illustrate the platform’s capabilities:

Bidirectional fault boundary detection.

High‑speed rail network simulation to pinpoint weak coverage segments.

Virus‑SMS detection using data analysis and machine‑learning techniques.

Platform Architecture

The core architecture is a generic Storm‑based pipeline:

Data is collected at the front end, distributed via a Kafka cluster, and then split into two streams: one feeding the Storm computation layer and the other stored directly in HDFS.

Benefits of Kafka in the Pipeline

Kafka serves as a high‑throughput data‑distribution hub, supporting seamless integration with various stream‑processing frameworks.

Its low latency (near‑zero) can be achieved with as few as three to six Kafka nodes, depending on scale.

Minimal changes are required to integrate with existing Hadoop components such as Flume and HDFS.

Scale and Performance Metrics

The deployed cluster consists of 20 nodes, each with 24 CPU cores, handling roughly 1.2 trillion records (≈50 TB). This volume is comparable to Tencent’s Spark Streaming workload, confirming the architecture’s ability to support the largest data‑intensive applications.

Technical Choices and Architecture Details

Storm’s master‑worker model (Nimbus and Supervisors) relies on Zookeeper for state coordination. Data enters through Spout components, is processed by Bolt s, and emitted as tuples. The platform adopts an “at‑least‑once” guarantee, but switches to incremental updates when hardware constraints prevent the extra 30 % overhead.

Common Pitfalls and Solutions

Flume vs. sink bottleneck: By inspecting Channel error logs, the team identified upstream congestion and resolved it by persisting data to disk and parallelizing processing.

Kafka traffic trap: Monitoring revealed network‑card saturation; moving from a dual‑NIC to a four‑NIC configuration raised throughput from ~190 MB/s to 250‑300 MB/s.

Result update strategy: Incremental updates reduced potential data loss from 50 % (full overwrite) to about 5 % when a Bolt failed.

Data skew: Fine‑grained aggregation (e.g., per‑cell instead of per‑city) mitigated uneven load and prevented partial result loss.

Redis throughput limits: When update rates exceeded ~500 k per minute, the single‑threaded Redis became a bottleneck; sharding the data across multiple Redis instances solved the issue.

Latency Measurement

Two latency components are tracked: (1) the delay from Flume ingestion to Kafka entry, and (2) end‑to‑end latency from Flume input to final Storm output. The team injects special test records at the Flume edge, reads them back via Storm, and computes a statistical distribution (baseline ≈15 s, with tails up to 35 s).

Data Cleaning and Filtering

Rather than discarding malformed records outright, the pipeline tolerates minor data quality issues, allowing imperfect records to pass if they can still contribute to the final result.

Active User Counting per Cell

To avoid memory explosion when tracking per‑user IDs, the solution employs probabilistic data structures. After evaluating Bloom Filters and HyperLogLog, the team selected Bloom Filters for their lower error rate despite higher resource consumption.

Monitoring the Stream Computing System

Beyond traditional host‑level metrics, the monitoring stack focuses on data backlog detection, using custom dashboards and alerts to surface accumulation in Kafka, Storm, or Redis pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

stream processingReal-time analyticsRedisKafkaBig Data ArchitectureFlumeApache Storm
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.