Improving Xianyu Messaging Reliability: Architecture, Issues, and Solutions
The article details how Xianyu’s 2020 messaging failures—lost messages, wrong avatars, and order status errors—were traced to duplicate IDs, push‑logic mismatches, and client bugs, and solved by introducing global UUIDs, ACK‑based retries, hierarchical conversation models, hybrid storage caching, and real‑time monitoring, boosting delivery reliability above 99.9%.
In early 2020 the Xianyu messaging service suffered from lost messages, incorrect avatars, and wrong order statuses. The team investigated the existing architecture and defined stability metrics such as send success rate, delivery rate, and client persistence rate.
The delivery chain consists of three steps: sender → server storage → receiver. Because mobile networks are unstable, an acknowledgment (ACK) mechanism is required to guarantee delivery. The article describes a six‑message request‑ACK model that ensures reliable transmission.
Key problems identified include:
Message uniqueness: the old scheme used SessionID, SeqID, and Version, which caused duplicate or missing messages when multiple clients sent concurrently.
Push logic: the server only pushes when it believes the client is online, leading to missed pushes when connection state is out of sync.
Client issues: multithreading bugs, inaccurate unread counters, and improper message merging.
Solutions implemented:
Engine upgrade – generate a global UUID for each message (e.g., a1a3ffa118834033ac7a8b8353b7c6d9 ) and use it for deduplication and ordering.
Retry and reconnection – add client‑side ACCS heartbeat detection and server‑side online checks with timeout handling.
Data synchronization – pull‑push queue isolation to avoid duplicate network requests and ensure consistent state.
Client model redesign – build a hierarchical conversation tree (virtual nodes, conversation nodes, folder nodes) to manage unread counts and message summaries.
Server storage model – adopt a hybrid read/write‑scatter approach with a limited‑size cache (max 256 messages) before falling back to database reads.
Quality monitoring – full‑link tracing using Flink‑processed logs, a checksum‑based reconciliation system, and real‑time metrics (delivery rate now >99.9%).
Future work includes enhancing message security, improving extensibility for new message types, standardizing the underlying protocol, and opening the platform for third‑party integrations.
Xianyu Technology
Official account of the Xianyu technology team
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.