Backend Development 7 min read

Design Considerations for a High‑Scale Messaging System: Capacity Estimation, Consistency Guarantees, and Avalanche Prevention

Designing Quanmin K‑Song’s high‑scale messaging system requires careful capacity estimation of throughput, storage and network traffic, robust consistency via unique transaction IDs and operation logs, and avalanche prevention through selective retries, scaling and priority‑based throttling to maintain reliability under load.

Tencent Music Tech Team
Tencent Music Tech Team
Tencent Music Tech Team
Design Considerations for a High‑Scale Messaging System: Capacity Estimation, Consistency Guarantees, and Avalanche Prevention

The messaging feature of Quanmin K‑Song includes two types: one aggregates messages related to user works—comments, gifts, etc.—vertically along a timeline for each user. The other is horizontal communication between users, providing a conversation list and detail view similar to QQ or WeChat. This article, based on this feature, shares three key considerations when designing the backend: capacity estimation, consistency guarantees, and avalanche prevention.

1. Capacity Estimation

Plan according to the situation, tailor the solution, strategize and win far ahead. Capacity estimation should be performed during design and before the service goes live:

1. Throughput Estimation

1) Response Time (RT)

The time required for a response.

2) Concurrency

The number of requests the system processes simultaneously, which can be understood as the number of processes in a synchronous single‑threaded scenario.

3) Queries Per Second (QPS)

QPS = concurrency / RT. We must ensure that peak QPS < the system's processing capacity QPS; this figure determines how many machines and total processes are needed.

2. Storage Estimation

How many records can each user store at most? What is the maximum size per record? What is the average size? How many records per user on average?

3. Network Traffic Estimation

How many RPC calls does each request involve? What are the outbound and inbound traffic volumes? Multiplying by QPS yields total traffic. Network cards have processing limits, which also determines the number of machines.

During design, QPS must be estimated. Different request scales require different solutions. If the request volume is very large, consider adding an asynchronous queue to handle requests asynchronously. For example, the private message sending process has three major steps: Step1 writes to the sender's storage, Step2 writes to the receiver's storage, Step3 notifies the receiver. Steps 2 and 3 can be processed asynchronously via a queue, reducing latency.

Another important design aspect is storage. Different storage solutions have varying read/write performance and unit cost, which must be matched to user scenarios. For instance, in‑memory KV stores are expensive but offer excellent performance, while SSD‑based storage has lower performance but lower cost and suitable features. There is no universally best component, only the most appropriate one. In our messaging system, we use KV storage for private message lists, and SSD‑based list storage (TLIST) for work messages and private message details.

2. Consistency Guarantees

To fulfill a user request, the backend typically makes multiple remote calls, making transaction consistency a challenge. Take a transfer as an example: A transfers to B; both accounts start with 1000, version numbers a1 and b1. The steps are: deduct 100 from A, add 100 to B. If the deduction succeeds but the addition times out, did B receive the amount? Before the operation, assign a unique transaction ID (Tid). A (a1) subtracts 100, then writes a log (Tid, s1, a1, -100, a2). If the log write fails, the versioned data ensures a retry won’t double‑subtract. Similarly, B (b1) adds 100 and writes a log (Tid, s2, b1, +100, b2). If B’s step times out, it is also retried. With operation logs, we can flexibly choose rollback or retry, and decide whether to retry immediately or later. In the messaging system, each message is also assigned a unique ID, ensuring traceability and retryability.

3. Preventing Avalanches

Retries can increase success rates, but not every call is suitable for retry. Retries raise the QPS to the callee service; if the callee lacks capacity, retries can cause an avalanche: requests queue, overall traffic grows, both normal and retry requests queue, and the caller may timeout and generate more retries… The root cause is insufficient resources on the callee side. The solution is twofold: scaling up and throttling. Scaling up expands the callee’s capacity; throttling classifies requests by priority, estimates processing capability, serves high‑priority requests first, and promptly rejects low‑priority ones that cannot be handled.

distributed systemsbackend designcapacity planningconsistencyavalanche prevention
Tencent Music Tech Team
Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.