Backend Development 14 min read

How to Build a Scalable Distributed Message Governance Platform for High Availability

This article shares Haro's practical experience in designing and operating a distributed message governance platform that unifies RocketMQ, Kafka, and other middleware, covering metrics, monitoring, alerting, scenario‑based controls, and high‑availability strategies to keep microservices reliable under sudden traffic spikes.

Alibaba Cloud Native

Jun 16, 2021

How to Build a Scalable Distributed Message Governance Platform for High Availability

Background

Haro has evolved into a comprehensive mobility platform covering two‑wheel and four‑wheel services, and its rapid business growth brings continuously increasing traffic. Sudden traffic bursts often cause major incidents, making traffic governance and system high‑availability essential.

Challenges

The team identified several pain points:

Message‑queue overload and flow‑control failures (e.g., RabbitMQ cluster throttling).

Service‑level database overload during peak traffic.

Insufficient monitoring of client usage, version compliance, and cluster health.

Design Principles

The platform aims to hide the complexities of underlying middleware (RocketMQ, Kafka) by providing a unified API that dynamically routes messages based on a unique identifier. It integrates resource control, search, monitoring, alerting, inspection, disaster recovery, and visual operations into a single solution.

Key Metrics

Core indicators include:

Message send/consume speed.

Message send/consume latency.

Message size.

Node health (heartbeat response time).

Link identifiers (traceId/rpcId).

Client SDK version.

Scenario‑Based Governance

Typical scenarios and corresponding controls:

Instant traffic spikes: Detect TPS surge, smooth traffic via pre‑warming.

Large messages: Flag messages >10 KB, encourage compression.

Outdated client versions: Report SDK version, push upgrades.

Consume flow removal/recovery: Listen for removal events, pause/resume consumption.

Send/consume latency: Alert when latency exceeds thresholds.

Improving troubleshooting efficiency: Index messages by msgId, embed traceId in headers for end‑to‑end tracing.

RocketMQ Specific Issues and Fixes

Issue 1 – CPU spikes on RocketMQ nodes

2020-03-16T17:56:07.505715+08:00 VECS0xxxx kernel: [] ? __alloc_pages_nodemask+0x7e1/0x960
2020-03-16T17:56:07.505717+08:00 VECS0xxxx kernel: java: page allocation failure. order:0, mode:0x20
2020-03-16T17:56:07.505719+08:00 VECS0xxxx kernel: Pid: 12845, comm: java Not tainted 2.6.32-754.17.1.el6.x86_64 #1
...

Solution: Upgrade the OS from CentOS 6 (kernel 2.6) to CentOS 7 (kernel 3.10), which eliminated the spikes.

Issue 2 – Delayed‑message loss

Delay messages stopped being consumed due to corrupted delayOffset.json and SCHEDULE_TOPIC_XXXX files. The fix was to delete these files, restart each broker, and verify that delayed messages work correctly again.

Service Tiering and Deployment

Applications are classified into four tiers (S1–S4) based on business and user impact. Core services (S1) are deployed in two isolated environments (Stable and Standalone) and protected from non‑core traffic. Non‑core services route to the Standalone environment, and S1‑to‑non‑core calls are guarded by circuit‑breaker policies.

High‑Availability Platform Architecture

The platform provides:

Unified SDK covering RocketMQ and Kafka.

Dynamic configuration that takes effect in real time.

Fine‑grained traffic statistics per resource and IP node.

Comprehensive health checks: node count, heartbeat latency, write/consume TPS watermarks, TPS change rates.

Automatic alerts for consumption backlog, speed drops, node offline, and partition imbalance.

Visual dashboards illustrate cluster health, service‑level metrics, and traffic distribution.

Conclusion

Effective message and service governance requires clear key metrics, tiered service classification, scenario‑driven controls, and a unified platform that abstracts middleware complexity while providing real‑time monitoring, alerting, and high‑availability mechanisms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Microservices RocketMQ

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.