Backend Development 21 min read

Building a Million-User Group Chat System: Server-Side Architecture and Implementation

The article details how to engineer a Web‑based group chat that supports one million members by selecting WebSocket for real‑time communication, using read‑diffusion storage, a three‑layer architecture with Redis routing and Kafka queues, ensuring ordered, reliable delivery via TCP, ACKs and UUID deduplication, calculating unread counts with Redis ZSETs, and handling massive traffic through rate‑limiting, protobuf compression and message chunking.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Building a Million-User Group Chat System: Server-Side Architecture and Implementation

This article introduces the technical challenges and solutions encountered when building a Web-based million-user group chat system, covering communication protocol selection, message storage patterns, message ordering, message reliability, and unread count statistics.

Background and Requirements: The system needs to support one million group members with near real-time capabilities, running entirely on H5 without client dependencies. This differs from mainstream IM products like WeChat (500 members) or QQ (2000 members).

Communication Technology: WebSocket was chosen for real-time bidirectional communication, as short/long polling suits low-real-time scenarios, and SSE only supports server-to-client push.

Message Storage: The article compares read-diffusion (single group mailbox shared by all members) versus write-diffusion (individual mailbox per member). For groups with over 10,000 members, read-diffusion is preferred due to lower write overhead and storage costs, despite more complex read logic.

Architecture Design: The system consists of Connection Service (WebSocket management with hash table for group-user mapping), Group Service (group routing via Redis), and Message Queue (Kafka for decoupling). Key design patterns include: load balancing with round-robin + weight strategy, IO thread isolation from business threads for performance and fault isolation, and stateful connection management with periodic reporting to Group Service.

Message Ordering: To ensure consistent message display order across all users, solutions include: using TCP for ordered delivery, hash-based thread pool assignment per user UID for internal processing, and a push-pull hybrid approach where messages are cached in Redis sorted set by group, and clients pull messages by ID range (startId to endId).

Message Reliability: No-loss is achieved through TCP protocol, ACK mechanism with retry logic, and eventual consistency via pull mechanism. No-duplication uses UUID generated by client with unique index on userId+UUID in database, plus client-side deduplication map.

Unread Count: For initially opened groups, a Redis ZSET structure (length 100) stores recent message IDs per group. Unread count is calculated via zrevrank command to get offset position. Product design compromise: maximum display is 99+.

Super Large Group Strategy: Addressing message storms includes: rate limiting on server-side, concurrent HTTP callbacks to connection service nodes, and local cache at connection service to reduce QPS to group service. Message compression using Protobuf saves 43% bandwidth. Chunk messaging merges multiple messages within 1-second intervals or 10-message thresholds to reduce IO system calls.

distributed architectureRedissystem designKafkahigh concurrencyIM systemWebSocketreal-time communicationRead Diffusionmessage ordering
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.