Disaster Recovery Design and Practices for Vivo Push System
Vivo’s push platform achieves high‑availability disaster recovery by deploying multi‑region broker clusters, implementing dual‑active logic nodes across two data centers, adding a Kafka‑backed buffering layer for traffic spikes, and using a hybrid Redis‑plus‑disk KV storage scheme to ensure durable, real‑time message delivery.
Vivo's push platform provides developers with a stable, reliable long‑connection service for real‑time message delivery, supporting tens of billions of notifications per day with sub‑second latency.
The system consists of an access gateway, logical push nodes, and a long‑connection layer (Broker) that maintains connections with mobile terminals.
Key characteristics are high concurrency, massive message volume, and timely delivery; the current peak push speed is 1.4 million messages per second, daily volume up to 20 billion, and end‑to‑end online delivery rate of 99.9%.
To guarantee availability under these demands, the article examines disaster‑recovery measures in three areas: system architecture, traffic spikes, and storage.
System architecture disaster recovery
Originally all Brokers were deployed in East China, causing cross‑region latency and a single VPC bottleneck. The architecture was optimized by deploying Brokers in three regions (North, East, South China) and using nearest‑region access. A global scheduler and dispatcher enable real‑time traffic rerouting when a region’s Broker cluster or public network fails, limiting impact to the affected region only.
Logic layer disaster recovery
The logical push nodes were moved from a single data center to a同城双活 (dual‑active) deployment across two Vivo IDCs, with traffic split by routing rules. Data storage remains primarily in a single data center due to cost and synchronization considerations.
Traffic spike disaster recovery
To handle sudden traffic surges without over‑provisioning, an optimization adds a buffering channel at the access layer. Excess traffic is diverted to a Kafka‑based message queue; a bypass access layer (docker‑deployed, auto‑scaling) consumes the queue at a controllable rate. Downstream flow monitoring triggers throttling when usage reaches ~80 % of capacity, allowing the queue to absorb bursts and later drain them when load subsides.
Storage disaster recovery for Redis
Three schemes were evaluated:
Dual‑write to an equivalent standby Redis cluster (requires double write operations).
RDB + AOF synchronization from the primary Redis to a standby cluster (no client‑side changes, but possible sync lag).
Use a smaller disk‑based KV store compatible with Redis protocol for asynchronous persistence of group‑push messages; single‑push messages stay in Redis. Group‑push messages are first written to Kafka, then consumed by bypass nodes to store in the disk KV. This approach saves resources while preserving message durability.
The article concludes that the push system now enjoys multi‑region Broker deployment,同城双活 logic layer, elastic traffic buffering with Kafka, and a hybrid Redis/disk‑KV storage strategy, with ongoing plans to evolve toward dual‑data‑center and multi‑center architectures.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.