Cloud Native 13 min read

Scaling Millions of IoT Vehicles: RocketMQ & Cloud‑Native Architecture in Action

Facing over a million concurrent connected vehicles, Chinese IoT leader ZhongRui chose RocketMQ over Kafka, leveraging Alibaba Cloud’s managed service to achieve low‑latency, high‑throughput, fault‑tolerant messaging, while adopting cloud‑native microservices, containerization, and serverless techniques to streamline operations and reduce costs.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Scaling Millions of IoT Vehicles: RocketMQ & Cloud‑Native Architecture in Action

Overview

The company operates a large‑scale vehicle‑IoT platform that collects sensor data from millions of connected terminals in real time. The platform must forward each data record to multiple downstream systems (online services, offline analytics, third‑party APIs) while guaranteeing low latency, reliable delivery, and the ability to buffer when consumers lag.

Core Messaging Requirements

Distribute incoming messages to several downstream pipelines without duplicating the payload.

Maintain sub‑millisecond latency for online command‑and‑control paths (e.g., real‑time vehicle instructions).

Provide durable buffering so that, if a consumer cannot keep up, messages are persisted and later replayed without loss.

Message‑Queue Landscape Assessment

Initial candidates included traditional brokers such as ActiveMQ and RabbitMQ . Both rely on a non‑distributed core design; scaling requires vertical hardware upgrades, which cannot meet the projected million‑messages‑per‑second load.

Kafka Prototype

Kafka’s native partitioned, replicated architecture appeared to satisfy horizontal scalability. A series of load‑tests revealed two critical deficiencies for the target use‑case:

No protocol‑level delivery guarantee. During a simulated network jitter event, a batch of messages was permanently lost, violating the strict “no‑loss” requirement of financial‑risk‑control workflows.

Extended performance dip on cluster expansion. Adding a new broker triggered the ISR (in‑sync replica) re‑synchronization process, causing a throughput reduction that persisted for more than one hour. No reliable mitigation strategy was identified.

Adoption of RocketMQ

RocketMQ, originally developed by Alibaba and open‑sourced as an Apache top‑level project, shares Kafka’s distributed model but adds several optimizations for transaction‑type workloads:

Protocol‑level guaranteed delivery. The producer receives an explicit acknowledgment from the broker only after the message is durably stored and replicated. This ensures that, even under network disturbances, the message will be delivered to the consumer.

Low latency and high throughput. Benchmarks from Alibaba’s Double‑11 shopping events demonstrate sub‑millisecond end‑to‑end latency at tens of millions of messages per second.

Built‑in high availability. NameServer clusters provide service discovery; Broker clusters maintain master‑slave replication with automatic failover.

Both the open‑source self‑managed version and the Alibaba Cloud managed service were evaluated. The managed offering was selected because it eliminates the operational burden of maintaining NameServer and master/slave failover mechanisms, and it provides instant horizontal scaling.

Managed RocketMQ Deployment Architecture

Key components of the production stack:

Message ingestion layer. Vehicle terminals publish JSON‑encoded telemetry to a dedicated RocketMQ topic via the rocketmq-client SDK.

Message distribution. A single RocketMQ broker instance fan‑outs each record to multiple consumer groups (online command service, offline analytics pipeline, third‑party API gateway) without requiring application‑level duplication.

Buffering and replay. If a consumer group lags, RocketMQ retains the messages in its persisted log until the consumer’s offset catches up, guaranteeing at‑least‑once delivery.

Observability. Alibaba Cloud ARMS is integrated for end‑to‑end latency tracing, broker health metrics, and consumer lag monitoring.

Serverless off‑loading. Non‑critical batch jobs are migrated to Function Compute (FC) to reduce baseline compute costs.

Production Scalability and Reliability

During peak traffic, more than 1 million vehicle terminals are online simultaneously, generating upwards of 1 million messages per second. The managed RocketMQ service automatically scales broker instances and partitions, balancing load across the cluster. Since deployment, the system has recorded zero message‑loss incidents and no service‑level failures, confirming the effectiveness of the built‑in replication and auto‑scaling mechanisms.

Operational Insights and Future Work

Adopting the cloud‑native managed service freed the engineering team from low‑level HA configuration (e.g., manual master‑slave promotion, DLedger tuning). Ongoing initiatives include:

Full containerization of micro‑services using Docker/Kubernetes to standardize deployment pipelines.

Deepening integration with ARMS for full‑stack performance dashboards and automated anomaly detection.

Expanding serverless adoption via Function Compute for bursty workloads, further reducing idle resource costs.

Continuous horizontal scaling tests to validate readiness for future growth beyond the current million‑messages‑per‑second baseline.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKafkaMessage QueueRocketMQIoT
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.