Operations 11 min read

Disaster Recovery Design and Practices for Vivo Push System

Vivo’s push platform achieves high‑availability disaster recovery by deploying multi‑region broker clusters, implementing dual‑active logic nodes across two data centers, adding a Kafka‑backed buffering layer for traffic spikes, and using a hybrid Redis‑plus‑disk KV storage scheme to ensure durable, real‑time message delivery.

vivo Internet Technology

Apr 26, 2023

Disaster Recovery Design and Practices for Vivo Push System

Vivo's push platform provides developers with a stable, reliable long‑connection service for real‑time message delivery, supporting tens of billions of notifications per day with sub‑second latency.

The system consists of an access gateway, logical push nodes, and a long‑connection layer (Broker) that maintains connections with mobile terminals.

Key characteristics are high concurrency, massive message volume, and timely delivery; the current peak push speed is 1.4 million messages per second, daily volume up to 20 billion, and end‑to‑end online delivery rate of 99.9%.

To guarantee availability under these demands, the article examines disaster‑recovery measures in three areas: system architecture, traffic spikes, and storage.

System architecture disaster recovery

Originally all Brokers were deployed in East China, causing cross‑region latency and a single VPC bottleneck. The architecture was optimized by deploying Brokers in three regions (North, East, South China) and using nearest‑region access. A global scheduler and dispatcher enable real‑time traffic rerouting when a region’s Broker cluster or public network fails, limiting impact to the affected region only.

Logic layer disaster recovery

The logical push nodes were moved from a single data center to a同城双活 (dual‑active) deployment across two Vivo IDCs, with traffic split by routing rules. Data storage remains primarily in a single data center due to cost and synchronization considerations.

Traffic spike disaster recovery

To handle sudden traffic surges without over‑provisioning, an optimization adds a buffering channel at the access layer. Excess traffic is diverted to a Kafka‑based message queue; a bypass access layer (docker‑deployed, auto‑scaling) consumes the queue at a controllable rate. Downstream flow monitoring triggers throttling when usage reaches ~80 % of capacity, allowing the queue to absorb bursts and later drain them when load subsides.

Storage disaster recovery for Redis

Three schemes were evaluated:

Dual‑write to an equivalent standby Redis cluster (requires double write operations).

RDB + AOF synchronization from the primary Redis to a standby cluster (no client‑side changes, but possible sync lag).

Use a smaller disk‑based KV store compatible with Redis protocol for asynchronous persistence of group‑push messages; single‑push messages stay in Redis. Group‑push messages are first written to Kafka, then consumed by bypass nodes to store in the disk KV. This approach saves resources while preserving message durability.

The article concludes that the push system now enjoys multi‑region Broker deployment,同城双活 logic layer, elastic traffic buffering with Kafka, and a hybrid Redis/disk‑KV storage strategy, with ongoing plans to evolve toward dual‑data‑center and multi‑center architectures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Redis Kafka traffic control Disaster Recovery Push System storage HA

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.