Big Data 12 min read

How China Mobile’s Real‑Time Computing Platform Scales Billions of Events with Flink

This article details China Mobile (Suzhou) Software Technology's evolution from Storm to Flink for real‑time computing, its multi‑version engine and log‑retrieval designs, signal‑business data pipeline optimizations, stability practices around ZooKeeper, and future directions in resource scaling and data‑lake integration.

Alibaba Cloud Developer

Mar 7, 2022

How China Mobile’s Real‑Time Computing Platform Scales Billions of Events with Flink

1. Real‑Time Computing Platform Overview

China Mobile (Suzhou) Software Technology Co., Ltd., a wholly‑owned subsidiary of China Mobile, builds cloud infrastructure, provides cloud services and draws the cloud ecosystem. Its mobile‑cloud‑centered products serve telecom, government, finance, transportation and other sectors.

The real‑time computing engine evolved through several stages:

2015‑2016: first‑generation engine Apache Storm.

2017: research on Apache Spark Streaming, integrated with a self‑developed framework to reduce operations cost.

2018: Flink adopted after studying streaming literature, meeting growing cloud demands.

2019‑2020: cloud services launched on public and private clouds.

2020‑2021: real‑time data warehouse (LakeHouse) introduced on Mobile Cloud.

Flink now handles signal digit processing, real‑time user profiling, data warehousing, operational monitoring, recommendation and data‑pipeline services.

The platform consists of three major parts:

Service management – task lifecycle hosting, multi‑version support for Flink, SQL, Spark Streaming jobs.

SQL support – online notebook, syntax checking, UDF and metadata management.

Task operation – log retrieval, performance metrics, message delay and back‑pressure alerts.

Two core designs are highlighted: multi‑version engine support and real‑time task log retrieval.

Multi‑Version Engine Design

To reduce debugging cost and simplify engine upgrades, tasks are submitted to the RTP service, uploaded to HDFS, and then launched on a YARN cluster. Since user jobs often contain the Apache Flink core package, the platform detects such core packages during JAR upload and blocks submission, yielding lower bug‑locating cost, easier version roll‑back, and improved stability.

Real‑Time Task Log Retrieval

Log retrieval is needed for complex business logic verification. The design uses a push model, AspectJWeaver AOP to intercept Log4j input/events, a RateLimiter for flow control, and forwards logs to Kafka and Elasticsearch for search.

With this mechanism, developers can query logs without modifying business code, verify logic efficiently, and avoid storage bottlenecks.

2. Optimization of China Mobile Signalling Business

The signalling business aggregates massive user‑location data (≈10 PB per day) for government planning, traffic analysis, tourism, etc. Data passes through Flume to Hadoop, but several issues arose:

Flume channel full alerts.

Firewall limits.

Kafka write timeouts.

Unstable Spark Streaming processing.

Problems fall into write‑performance bottlenecks and architectural complexity.

Optimizations included tuning firewall ports, Kafka server parameters, and client‑side settings (batch.size = 256 MB, buffer.memory = 128 MB, concurrency = 4) to approach network‑card limits. Compression gains were limited due to Kafka version constraints.

Ultimately Flink replaced Flume, improving ingestion performance, stabilizing processing, clarifying component responsibilities, and reducing development and O&M costs, achieving roughly a one‑third performance boost.

3. Stability Practices

Job stability concerns include failures, latency, OOM, and unexpected restarts. Mitigations involve physical isolation, service degradation, enhanced monitoring, and service splitting.

ZooKeeper network interruptions can cause massive job restarts because Flink’s Curator 2.0 drops the leader when entering a Suspended state. The issue was fixed in Flink 1.14; earlier versions required custom LeaderLatch handling and modifications to ZooKeeperCheckpointIDCounter.

4. Future Exploration

Future work focuses on:

Resource utilization – elastic scaling research and K8s Yunikorn queue management for Flink on cloud.

Data lake – unified stream‑batch gateway, data lineage, asset, and quality services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Real-time Flink kafka Cloud

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.