Backend Development 17 min read

How Vivo Scaled Its Push Platform with Redis: Lessons in High‑Concurrency Optimization

This article details how Vivo's push notification platform leverages Redis for massive message throughput, identifies bottlenecks encountered during peak traffic, and outlines a series of capacity, clustering, hot‑key, and client‑side optimizations that reduced load by up to 15% and improved response times dramatically.

ITPUB

Feb 18, 2022

How Vivo Scaled Its Push Platform with Redis: Lessons in High‑Concurrency Optimization

Platform Overview

Vivo's push platform provides real‑time message delivery to mobile users via a stable long‑connection between cloud and client, supporting billions of notifications with a peak throughput of 1.4 million messages per second and a daily peak of 150 billion messages, achieving a 99.9% end‑to‑end delivery rate.

Redis in the Platform

To meet high concurrency and low‑latency requirements, the platform uses two Redis clusters: a msg cluster for storing message bodies and a client cluster for token and client information. The architecture stores messages in the msg cluster, checks client status in the client cluster, and manages waiting queues for offline clients.

Online Issues Encountered

During a high‑traffic event (5.2 billion messages in 30 minutes), the msg Redis cluster suffered from excessive connections (up to 24,674) and memory spikes (23.46 GB), causing average response times of ~500 ms and reducing overall system availability to 85%.

Redis Optimization Strategies

The team addressed four main areas:

Capacity : Reduce stored data size, compress large values, and set appropriate expirations.

Hot‑key Skew : Ensure key randomness, limit hotspot concurrency, and use local caching or rate‑limiting.

Cluster Size : Avoid overly large clusters; split when node count becomes a stability risk.

Version Upgrade : Move from Redis 3.x to 4.x to gain PSYNC 2.0, LFU eviction, non‑blocking delete, memory commands, and better memory efficiency.

Capacity Optimization of msg Cluster

Using the open‑source RDB analysis tool RDR (GitHub), the team discovered that ~80% of keys started with mi:, most of which were single‑push messages. They implemented two measures:

Immediately delete single‑push messages after receiving a PUBACK or when they are blocked.

Aggregate identical message contents into a single stored entry and reuse the same message ID for multiple pushes.

Result: total stored size dropped from 3.65 TB to 2.09 TB, a 58% reduction.

Cluster Splitting Based on Business Attributes

To alleviate pressure, the msg cluster was divided into two separate clusters: one for message bodies and another for waiting queues. Two migration plans were evaluated:

Plan 1 : Extract the waiting‑queue cluster only – simpler node changes but risked data loss.

Plan 2 : Extract the message‑body cluster – required dual‑read during transition but preserved data integrity and allowed dynamic scaling.

The team chose Plan 2, implementing a four‑rule read‑strategy (read‑old‑only, read‑new‑only, read‑old‑then‑new, read‑new‑then‑old) controlled via a configuration center. The rule switches based on old‑cluster hit‑rate thresholds.

After splitting, peak CPU load dropped from >95% to ~70% and average response time fell from 1.2 ms to 0.5 ms.

Hot‑Key Investigation and Fix

During the April spike, the key mii:0 became a hotspot because the Snowflake‑generated messageId often ended with 12 zero bits, causing many IDs to map to slot 0. This created a massive mii:0 hash.

Fixes applied:

Randomize the initial sequence value of the Snowflake algorithm (0‑1023) to disperse IDs across slots.

Replace frequent HEXISTS checks with type‑based logic to avoid unnecessary hot‑key reads.

Result: the mii index became evenly distributed, and connection/memory spikes disappeared.

Client Redis Concurrency Optimization

Client information was previously cached for 7 days on push nodes, but high‑frequency queries to the client Redis cluster caused heavy load. The solution introduced three cache layers:

Cache : Infrequently changing data (7‑day TTL).

Cache1 : Frequently changing data such as online status.

Cache2 : Encryption parameters, refreshed only on connection.

Additional mechanisms ensure cache consistency via broker feedback and client connect/disconnect events.

After the change, cache1 hit‑rate reached 52%, cache2 30%, and client Redis concurrency decreased by ~20%, reducing overall Redis load by about 15% during peak periods.

Key Takeaways

1) Design Redis keys with good randomness to avoid slot and hot‑key concentration; keep keys small.

2) Monitor packet size—large payloads (>1000 bytes) can sharply degrade throughput; perform realistic network and load testing.

Overall, the case demonstrates that Redis can sustain massive push workloads when its capacity, key design, clustering strategy, and client‑side caching are carefully engineered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

push notifications scalability System Optimization Redis

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.