How to Build a Scalable Distributed Message Governance Platform for High Availability
This article shares Haro's practical experience in designing and operating a distributed message governance platform that unifies RocketMQ, Kafka, and other middleware, covering metrics, monitoring, alerting, scenario‑based controls, and high‑availability strategies to keep microservices reliable under sudden traffic spikes.
Background
Haro has evolved into a comprehensive mobility platform covering two‑wheel and four‑wheel services, and its rapid business growth brings continuously increasing traffic. Sudden traffic bursts often cause major incidents, making traffic governance and system high‑availability essential.
Challenges
The team identified several pain points:
Message‑queue overload and flow‑control failures (e.g., RabbitMQ cluster throttling).
Service‑level database overload during peak traffic.
Insufficient monitoring of client usage, version compliance, and cluster health.
Design Principles
The platform aims to hide the complexities of underlying middleware (RocketMQ, Kafka) by providing a unified API that dynamically routes messages based on a unique identifier. It integrates resource control, search, monitoring, alerting, inspection, disaster recovery, and visual operations into a single solution.
Key Metrics
Core indicators include:
Message send/consume speed.
Message send/consume latency.
Message size.
Node health (heartbeat response time).
Link identifiers (traceId/rpcId).
Client SDK version.
Scenario‑Based Governance
Typical scenarios and corresponding controls:
Instant traffic spikes: Detect TPS surge, smooth traffic via pre‑warming.
Large messages: Flag messages >10 KB, encourage compression.
Outdated client versions: Report SDK version, push upgrades.
Consume flow removal/recovery: Listen for removal events, pause/resume consumption.
Send/consume latency: Alert when latency exceeds thresholds.
Improving troubleshooting efficiency: Index messages by msgId, embed traceId in headers for end‑to‑end tracing.
RocketMQ Specific Issues and Fixes
Issue 1 – CPU spikes on RocketMQ nodes
2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel: [] ? __alloc_pages_nodemask+0x7e1/0x960
2020-03-16T17:56:07.505717+08:00 VECS0xxxx kernel: java: page allocation failure. order:0, mode:0x20
2020-03-16T17:56:07.505719+08:00 VECS0xxxx kernel: Pid: 12845, comm: java Not tainted 2.6.32-754.17.1.el6.x86_64 #1
...Solution: Upgrade the OS from CentOS 6 (kernel 2.6) to CentOS 7 (kernel 3.10), which eliminated the spikes.
Issue 2 – Delayed‑message loss
Delay messages stopped being consumed due to corrupted delayOffset.json and SCHEDULE_TOPIC_XXXX files. The fix was to delete these files, restart each broker, and verify that delayed messages work correctly again.
Service Tiering and Deployment
Applications are classified into four tiers (S1–S4) based on business and user impact. Core services (S1) are deployed in two isolated environments (Stable and Standalone) and protected from non‑core traffic. Non‑core services route to the Standalone environment, and S1‑to‑non‑core calls are guarded by circuit‑breaker policies.
High‑Availability Platform Architecture
The platform provides:
Unified SDK covering RocketMQ and Kafka.
Dynamic configuration that takes effect in real time.
Fine‑grained traffic statistics per resource and IP node.
Comprehensive health checks: node count, heartbeat latency, write/consume TPS watermarks, TPS change rates.
Automatic alerts for consumption backlog, speed drops, node offline, and partition imbalance.
Visual dashboards illustrate cluster health, service‑level metrics, and traffic distribution.
Conclusion
Effective message and service governance requires clear key metrics, tiered service classification, scenario‑driven controls, and a unified platform that abstracts middleware complexity while providing real‑time monitoring, alerting, and high‑availability mechanisms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
