How to Ensure Reliability, Ordering, and Security in Billion‑User IM Systems
This article explores the key challenges of building a large‑scale instant‑messaging service—including message reliability, ordering, read‑sync, data security, avalanche effects, and weak‑network handling—and presents practical architectural and algorithmic solutions for each problem.
1. Introduction
This article builds on the outlines of Deng Yunzhe’s "Large‑Scale Concurrent IM Service Architecture Design" and "IM Weak‑Network Scenario Optimization" (see references at the end). It focuses on several crucial topics of a billion‑user IM architecture such as message reliability, ordering, data security, and weak‑network issues.
2. Series Articles
The content is split into two parts; this is the second part, which dives deeper into the detailed and important hot issues of the IM architecture.
3. Message Reliability Issues
Reliability is a core metric for any IM system; users must trust that their messages will not be lost. From a product perspective, a lack of reliability leads to rapid user churn. The reliability solution consists of two logical parts:
Uplink message reliability
Downlink message reliability
Uplink reliability : The client assigns a local ID to the message and waits for a server acknowledgment (PIMSendAck). If the ACK is not received within a timeout, the SDK retries.
Downlink reliability : When the server pushes a message to multiple recipients, it must cache the push request. The message is written to each recipient’s offline‑message list; after the client acknowledges receipt, the entry is removed. This ensures both real‑time and offline message reliability.
Further reading:
"IM Message Delivery Guarantee (Part 1): Reliable Real‑Time Delivery"
"IM Message Delivery Guarantee (Part 2): Reliable Offline Delivery"
4. Message Ordering Issues
Distributed IM systems face ordering challenges because client and server clocks may diverge, leading to out‑of‑order delivery. The proposed solution includes:
Server‑time alignment (handled by operations)
Client‑side time calibration against the server
Including both local and server timestamps in each message and applying an interpolation sort: messages from the same sender are ordered by local time, while messages from different senders are ordered by server time.
Additional resources on message‑ID ordering algorithms are also suggested.
5. Message Read‑Sync Issues
Read‑receipt functionality becomes complex when a user is logged in on multiple devices. The synchronization logic relies on two mechanisms:
Maintain a timestamp per session indicating the last read message.
When a session is active, broadcast a PIMSyncRead message to other devices.
6. Data Security Issues
6.1 Basic
IM security involves both communication security (socket long‑connections and HTTP short‑connections) and content security. Balancing security, performance, traffic, and user experience is challenging.
6.2 Communication Security
Typical IM services consist of:
Socket long‑connection services (TCP/UDP)
HTTP short‑connection services (REST APIs)
Recommended reading includes articles on TLS 1.3‑based MMTLS, combination encryption algorithms, and HTTPS fundamentals.
6.3 Content Security
Cryptography provides encryption, authentication, and identification. End‑to‑end encryption (E2EE) is essential for protecting message content, as exemplified by Telegram.
Further reading:
"Mobile End‑to‑End Encryption (E2EE) Technical Details"
"Real‑Time Audio/Video Chat E2EE Working Principles"
7. Avalanche Effect Issue
In a distributed IM architecture, a failure in one data center can cause a cascade of overloads in other centers. Mitigation strategies include server‑side rate limiting and client‑side reconnection back‑off or load‑balancer‑assisted server selection.
8. Weak‑Network Issues
8.1 Causes of Weak Networks
Mobile IM frequently encounters weak‑network scenarios (elevators, trains, cars, subways) due to signal fluctuation, interference, uneven base‑station distribution, and high mobility.
8.2 IM Handling of Weak Networks
The core handling consists of:
Automatic message retransmission
Offline message reception
Resend ordering
Offline command processing
8.3 Automatic Message Retransmission
Clients should maintain a state machine for each message (initial, sending, failed, timeout) and automatically retry a few times before notifying the user of failure.
8.4 Offline Message Reception
Detecting offline status can be done via long‑connection heartbeat loss, repeated request failures, or device network‑status APIs. Once connectivity is restored, the client pulls missed messages from the server’s offline queue.
8.5 Resend Message Ordering
When a message is retried after a network glitch, the final receive order should follow the interpolation sort described in section 4 (local time for the same sender, server time for different senders).
8.6 Offline Command Processing
Operations performed while offline (e.g., deleting a contact) must be queued and synchronized with the server once the network recovers.
8.7 Summary
Weak‑network handling for IM is relatively straightforward: automatic retries combined with proper message state tracking solve most problems. More complex scenarios, such as video conferencing under high packet loss, require additional techniques.
9. Article Summary
The two‑part series on large‑scale IM architecture covers overall design, service splitting, and deep dives into reliability, ordering, read‑sync, security, avalanche effects, and weak‑network optimization. Beginners are encouraged to read the curated “from zero to IM” guide for a systematic learning path.
10. References
Large‑Scale Concurrent IM Service Architecture Design
IM Weak‑Network Scenario Optimization
Zero‑Basis IM Development Intro (3): What Is IM Reliability?
IM Message Delivery Guarantee (Part 1): Reliable Real‑Time Delivery
IM Message Delivery Guarantee (Part 2): Reliable Offline Delivery
Instant Messaging Security (Part 2): Combined Encryption Algorithms in IM
WeChat Next‑Gen Communication Security: MMTLS Based on TLS 1.3
Zero‑Basis Mobile IM Development Guide
Architecture & Thinking
🍭 Frontline tech director and chief architect at top-tier companies 🥝 Years of deep experience in internet, e‑commerce, social, and finance sectors 🌾 Committed to publishing high‑quality articles covering core technologies of leading internet firms, application architecture, and AI breakthroughs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.