Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar
vivo’s Internet Storage team details how it operates RocketMQ for low‑latency online services and Kafka for massive big‑data pipelines, outlines resource isolation, traffic balancing, intelligent throttling, and governance practices, and describes its migration from RabbitMQ and planned shift from Kafka to cloud‑native Pulsar.
Author: vivo Internet Storage Technology Team - Luo Mingbo, Middleware Team - Liu Runyun
This article is compiled from the "2022 vivo Developer Conference" presentation and introduces the application practice of distributed message middleware in ultra‑large data scale scenarios at vivo.
1. Current Operational Status of Distributed Message Middleware at vivo
1.1 Technology Selection
For online services, vivo chose RocketMQ as the core messaging platform due to its rich feature set, high throughput, and ability to handle peak‑shaving, decoupling, and asynchronous communication. For big‑data workloads, Kafka was selected for its high concurrency, high availability, low latency, and massive throughput, serving as the unified data ingestion and real‑time warehouse service.
1.2 Scale Overview
Big‑data side: Kafka clusters handle hundreds of projects, tens of thousands of topics, daily processing of tens of trillions of messages, 99.99% availability, and per‑node processing of hundreds of billions of messages per day. Online side: RocketMQ clusters serve hundreds of projects, thousands of services, processing hundreds of billions of messages daily with 100% availability and average send latency <1 ms.
2. Big‑Data Side Best Practices
2.1 Kafka Overview
Kafka is an Apache‑hosted, high‑throughput distributed publish‑subscribe system, originally open‑sourced by LinkedIn in 2010 and graduated to a top‑level Apache project in 2012.
2.2 Challenges in Ultra‑Large Scale Scenarios
Resource isolation among core, high‑priority, and general business workloads.
Ensuring intra‑cluster traffic balance to avoid resource waste.
Dynamic throttling to maintain stability while minimizing impact on business availability.
Maintaining high availability with diverse client versions over long‑term operation.
2.3 Resource Isolation
vivo combines physical isolation (dedicated clusters for commercial, monitoring, logging, etc.) with logical isolation via resource groups within a cluster, allowing independent service groups to coexist without interference.
2.4 Traffic Balancing
Two‑phase implementation: Phase 1 introduced real‑time traffic, CPU, and disk metrics as load factors for partition migration, reducing intra‑group traffic variance from hundreds of MB/s to tens of MB/s. Phase 2 added partition, leader, replica, and disk balancing, further lowering variance to under ten MB/s, improving resource utilization by ~75%.
2.5 Intelligent Dynamic Throttling
A three‑step process: (1) Multi‑platform diagnosis to decide if throttling adjustment is needed; (2) Intelligent analysis of cluster load to compute optimal thresholds; (3) Automatic real‑time adjustment of throttling limits. This improves burst handling, resource utilization, and reduces operational cost.
2.6 Cluster Governance
Unified metadata management via ZooKeeper, comprehensive governance of node traffic, topic metadata, partition skew, oversized partitions, and consumer lag, ensuring high availability.
2.7 Experience Accumulation
Three years of practice yielded capabilities such as rack‑aware placement, elastic scaling, data compression, multi‑platform alerting (user throttling, topic traffic spikes, consumer lag, leader health), and fault perception.
2.8 Architectural Deficiencies of Kafka
Low resource utilization, slow response to business growth, long recovery time, and high failure rate for historical data consumption due to tight coupling of partitions to disks and lack of storage‑compute separation.
2.9 Future Planning – Pulsar
Since 2021, vivo has evaluated Pulsar, a cloud‑native, compute‑storage separated messaging system. Pulsar offers multi‑tenant support, persistent storage, cross‑region replication, and high concurrency. The roadmap includes four stages: project initiation, stability building, capability advancement, and stable operation, targeting daily processing of trillions of messages by 2024.
3. Online Business Side Best Practices
3.1 RocketMQ Overview
RocketMQ, open‑sourced by Alibaba in 2012, provides low latency, high concurrency, high availability, and reliable retry mechanisms.
3.2 Deployment and High‑Availability
Dual‑data‑center hot‑standby architecture with broker deployment in both sites, automatic cross‑site traffic switching on failure, and a BrokerController module for master‑slave failover.
3.3 Platform System Architecture
Modules include mq‑rebalance (load balancing), mq‑monitor (metrics collection), mq‑recover (traffic degradation/recovery), and mq‑live (health checks). A RabbitMQ‑connector provides global routing, and future work includes a gRPC‑based unified message gateway.
3.4 Platformized Operations
Configuration management, one‑click topic scaling, broker traffic isolation/attachment, and centralized cluster information improve operational efficiency.
3.5 Monitoring and Alerting
Comprehensive dashboards display production/consumption traffic, latency, and other key metrics. Alerts cover host, cluster, topic/group, and client dimensions, enabling rapid issue detection and resolution.
3.6 Migration from RabbitMQ to RocketMQ
An AMQP message gateway parses AMQP protocol and forwards traffic, while an MQ‑Meta module maps RabbitMQ metadata (Exchange → Topic, Queue → Group, RoutingKey → message header, VirtualHost → Namespace) to RocketMQ. Performance tests show a single 8C16G machine can handle >90k TPS for sending and >60k TPS for pushing 1KB messages.
3.7 Future Plans for Online Messaging
Exploring RocketMQ 5.0 with compute‑storage separation and POP consumption, building a gRPC‑based unified gateway, and containerizing the middleware for elastic scaling.
4. Summary
vivo has migrated online business messaging from RabbitMQ to RocketMQ and is evolving big‑data messaging from Kafka to Pulsar. The company foresees continued cloud‑native evolution of message middleware to meet rapid business growth and deliver optimal user experience.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.