Industry Insights 14 min read

How Vivo Built a Highly Available Push System: Multi‑Region Architecture, Real‑Time Traffic Scheduling, and Disaster‑Recovery Strategies

This article analyzes the design of Vivo's push notification platform, detailing its high‑concurrency requirements, three‑region long‑connection deployment, traffic‑scheduling bypass layer, and layered storage disaster‑recovery solutions, while explaining the trade‑offs and performance metrics behind each architectural decision.

Architect
Architect
Architect
How Vivo Built a Highly Available Push System: Multi‑Region Architecture, Real‑Time Traffic Scheduling, and Disaster‑Recovery Strategies

1. Push System Overview

Vivo’s push platform delivers real‑time messages to mobile devices via a persistent long‑connection between cloud and client. The service handles up to 1.4 million pushes per second, processes >200 billion messages per day, and achieves a 99.9 % end‑to‑end delivery rate. High concurrency and unpredictable traffic spikes require layered availability mechanisms.

2. Disaster‑Recovery Architecture

2.1 Long‑Connection Layer

Original design placed all broker nodes in the East‑China region, creating two failure modes:

Geographic latency: Users in North and South China had to traverse a long distance to reach the East‑China broker, degrading connection stability.

Single VPC bottleneck: One VPC link between logical IDC nodes and the East‑China broker became a bandwidth choke point and a single point of failure.

To mitigate these issues the architecture was refactored to a three‑region deployment (North, East, South China). Devices now register with a dispatcher that returns IPs for all three regions; the client initially connects to the nearest broker. If a broker cluster or its public network fails, only the affected region loses service while the other regions continue operating.

A global scheduler was added to orchestrate failover. When a region’s broker reaches connection limits or its VPC fails, the scheduler issues a policy that forces devices in that region to fetch a new IP set from the dispatcher and reconnect to a healthy broker. The policy is withdrawn once the region recovers.

2.2 Logic Layer

The logic tier originally ran in a single IDC, exposing a complete service outage if that IDC lost power. A “same‑city active‑active” deployment was introduced: two IDC sites (IDC1 and IDC2) host identical gateway instances that split traffic according to routing rules. The persistent data store remains in IDC1 to avoid cross‑IDC replication latency and cost.

This configuration provides immediate failover for the logic tier while keeping the data store single‑site, a trade‑off between availability and data‑sync complexity.

2.3 Traffic‑Control Layer

Historical spikes (e.g., breaking news) caused overloads. Two contrasting designs were evaluated:

Traditional over‑provisioning: Deploy enough machines to cover peak historical traffic. This incurs high cost and still fails if a surge exceeds the provisioned capacity.

Optimized buffering bypass: Insert a Kafka‑backed buffer between the access tier and the core push pipeline. Excess requests are queued in Kafka; a Docker‑based bypass service consumes the queue at a rate limited by CPU load. When downstream push nodes reach 80 % of a safe utilization threshold, the access tier is rate‑limited, causing further traffic to accumulate in Kafka. Once downstream load drops, the limit is lifted and the bypass service scales up to drain the backlog quickly.

Kafka was chosen for its high throughput and because it already serves offline analytics, avoiding additional infrastructure.

The bypass service auto‑scales based on CPU metrics, enabling proactive scaling for known peak windows and reactive scaling for unexpected spikes.

3. Storage Disaster‑Recovery

Push payloads are cached in a Redis cluster. A Redis outage would cause message loss, so three mitigation options were examined:

Option 1 – Dual Redis clusters with write‑through: The push service writes synchronously to two identical Redis clusters. Guarantees availability but doubles Redis capacity and operational complexity.

Option 2 – RDB+AOF sync to a standby Redis: Primary Redis replicates via RDB+AOF to a backup cluster. No application changes required, but replication lag can lead to missed pushes.

Option 3 – Hybrid Redis + Disk‑KV: Single‑push payloads stay in Redis (0.5 ms write latency). Group‑push payloads are asynchronously written to a small disk‑based KV store (≈5 ms latency) via Kafka. During a Redis failure, single‑push messages carry their payload downstream, while group‑push messages are retrieved from the KV store.

Option 3 was selected because it balances cost and performance: only the less latency‑sensitive group‑push payloads are duplicated to the slower KV store, and the asynchronous Kafka pipeline keeps the main push path fast.

4. Summary of Evolution

The push platform progressed through four systematic steps:

Identify latency and single‑point‑of‑failure problems in the long‑connection layer.

Deploy a three‑region broker topology and a global scheduler to achieve regional failover and dynamic IP reassignment.

Introduce same‑city active‑active logic gateways to eliminate logic‑tier outages while retaining a single data store for cost efficiency.

Implement a Kafka‑backed buffering bypass that dynamically scales with CPU load, providing cost‑effective handling of traffic spikes.

Evaluate storage‑DR options and adopt a hybrid Redis + disk‑KV design that preserves message delivery without excessive resource duplication.

Each layer was analyzed for failure modes, compared against alternatives with concrete metrics (e.g., 0.5 ms vs 5 ms write latency, 80 % utilization threshold, 1.4 M pushes/s capacity), and the final choices were justified on the basis of latency, cost, and operational complexity.

Code example

相关阅读:
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeSystem Architecturepush notificationsredisKafkaTraffic Schedulingdisaster recovery
Architect
Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.