Designing a Resilient Sequence Service: From Master‑Slave to Dynamic Routing
This article explains how the seqsvr service in WeChat evolved its disaster‑recovery architecture from a simple master‑slave model to a dynamic routing solution, detailing design principles, lease mechanisms, and operational optimizations that ensure monotonic UID sequences and high availability.
Disaster Design
We introduce the disaster‑recovery architecture of seqsvr, noting that backend systems rarely have a single perfect solution; the same requirement can lead to very different designs in different environments. Therefore, we focus on the design thinking and trade‑offs behind seqsvr's disaster design.
1. Keep the architecture simple 2. Avoid strong dependencies on external modules Both principles stem from reliability concerns—complexity is inversely proportional to reliability, so a simple core design is essential.
The core requirement of seqsvr is that each UID's sequence must increase without rollback. If at any moment only one AllocSvr serves a given UID, monotonic sequences are easy to guarantee.
Consequently, a single‑point service model is adopted. When an AllocSvr becomes unavailable, its UID range is switched to another server. An arbitration service monitors AllocSvr health, writes the mapping configuration to StoreSvr, and each AllocSvr periodically reads this configuration to decide which UID ranges to load.
Figure 5 shows two AllocSvr instances serving the same UID, causing a sequence rollback (client sees 101, 201, 102).
Because only one AllocSvr may serve a UID, a multi‑master model is unsuitable. Instead, we use a single‑point service mode and introduce a lease mechanism to avoid stale AllocSvr serving incorrect data. The lease has two conditions:
1. Lease expiration: if an AllocSvr cannot read the configuration from StoreSvr for N seconds, it stops serving. 2. Lease activation: when an AllocSvr reads a new configuration, it immediately unloads the old UID ranges; new ranges become active after N seconds.
This ensures that a new AllocSvr only starts serving after the old one has gone offline. A brief period of unavailability may occur, but the backend retry mechanisms make it invisible to users.
Disaster 1.0 Architecture: Master‑Slave
The initial version used a primary‑plus‑cold‑standby model. The full UID space is divided into N sections; consecutive sections form a set, each set having one primary and one standby AllocSvr. The primary serves under normal conditions; upon failure, the arbitration service switches roles, making the standby the new primary.
Figure 8 illustrates the master‑slave disaster‑recovery layout.
Design trade‑offs for this model include:
1. Simplicity enables rapid development. 2. Few machines, so redundancy is not a primary concern. 3. Client‑side routing configuration is easy to update.
Because AllocSvr is not stateless, the client cannot know which server holds a specific UID range. With a master‑slave setup, the client can simply try one server; if it fails, it retries the other, incurring at most one extra request.
Drawbacks of Master‑Slave
The master‑slave design suffers from two major issues:
1. Scaling (adding or removing machines) is cumbersome. 2. When both primary and standby of a set are overloaded, other sets cannot help.
Configuration changes require coordinated updates to both clients and AllocSvr, often involving manual steps and iptables redirection.
Disaster 2.0 Architecture: Embedded Routing Table
To break the client‑side routing inconsistency, we embed the current routing table into the sequence response packet, allowing the client to stay synchronized with AllocSvr without extra resources.
All modules share a unified routing table that maps UID ranges to AllocSvr instances. The arbitration service generates this table, writes it to StoreSvr, and AllocSvr reads it as a lease, then attaches it to the response.
Figure 9 shows the dynamic segment migration in the 2.0 architecture.
Embedding the routing table enables flexible disaster strategies: any machine can act as a backup, and failed UID ranges are evenly migrated to available AllocSvr instances. Load‑balancing based on AllocSvr load further improves machine utilization.
Operationally, the new design simplifies maintenance: updating the routing table is sufficient to bring machines online or offline, eliminating complex configuration synchronization.
Figure 10 illustrates segment migration after a machine failure.
Routing Synchronization Optimization
Embedding the routing table raises a chicken‑or‑egg problem: without a table, how does the client know which AllocSvr to query? Also, sequence requests are high‑frequency, so bandwidth impact must be minimal.
The solution uses a client‑side in‑memory cache of the routing table and its version number. The request flow is:
Client selects an AllocSvr based on the cached routing table; if none exists, it picks a random AllocSvr.
The request includes the local routing table version.
AllocSvr processes the sequence request and, if the version is outdated, attaches the latest routing table to the response.
Client updates its local table when a new version is received and may retry step 1 if needed.
This approach requires only a few retries when the local table is stale, ensuring correct routing with minimal overhead.
Summary
We have covered the evolution of seqsvr’s architecture—from a simple master‑slave model to a dynamic, embedded‑routing solution—demonstrating how a clean, reliable design can support WeChat’s rapid growth while simplifying operations and improving resource utilization.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITFLY8 Architecture Home
ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
