Backend Development 48 min read

Design and Implementation of a High‑Availability Scalable IM Group‑Chat Messaging System

This article presents a comprehensive design and implementation of a high‑availability, horizontally scalable instant‑messaging group‑chat system, detailing its architecture, component interactions, scaling strategies, reliability mechanisms, and extensions for offline and single‑chat messaging.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Implementation of a High‑Availability Scalable IM Group‑Chat Messaging System

The article describes a production‑grade instant‑messaging (IM) group‑chat system that can handle massive user counts and high concurrency. It explains why group chat is technically more challenging than one‑to‑one messaging due to real‑time write‑fan‑out and the need for ordered, efficient delivery.

It contrasts a real‑time messaging system with a traditional message‑queue system, highlighting differences in group dynamics, ordering guarantees, and cost trade‑offs.

0. Minimal Implementation

A simple prototype built in September 2017 consisted of a single Proxy, three Brokers, and one Router. The system forwards room messages without persisting them, relies on UDP for transport, and scales only by redeploying the whole stack.

Key terminology:

1 Client : internal message publisher (not the mobile app client). 2 Proxy : external entry point that collects client messages and forwards them to a Broker. 3 Broker : forwards messages to all Gateways that have members of the target room. 4 Router : distributes login/logout notifications from Gateways to all Brokers. 5 Gateway : server‑side entry that receives client connections and forwards login/logout events to the Router. 6 Room Message : chat content inside a room. 7 Gateway Message : user login/logout information.

1. Scalability Design

The system adopts partition‑replica architecture similar to database sharding. Both Room Message and Gateway Message are routed through Brokers, which are horizontally scalable. Zookeeper is used as a registry to notify components of new partitions or replicas.

1.1 Client

Loads the registry address from a config file.

Watches /pubsub/proxy to build an ordered ProxyArray .

Runs a thread that continuously watches the path for changes and a periodic poll as a fallback.

Assigns a Snowflake‑generated MessageID to each outgoing message and selects a Proxy via load‑balancing.

1.2 Proxy

Registers itself under /pubsub/proxy and obtains a replica ID.

Discovers Broker partitions via /pubsub/broker/partition(x) and the current partition count via /pubsub/broker/partition_num .

Watches the Broker path for new partitions, replica additions, or failures, and also sends periodic heartbeats to each replica.

Routes a room message to a specific Broker replica using the rule BrokerPartitionID = RoomID % BrokerPartitionNum and BrokerReplicaID = RoomID % BrokerPartitionReplicaNum .

The Proxy processing pipeline consists of three stages—receiving, protocol conversion, and sending—implemented with lock‑free queues, raising throughput from ~50 000 msg/s to ~200 000 msg/s.

1.3 Large‑Room Handling

Very large rooms are split into 64 virtual rooms (VRoom) using VRoomID = UserID % 64 , allowing the existing pipeline to handle them without code changes.

1.4 Broker

Registers under /pubsub/broker/partitionN and loads its partition’s RoomGatewayList from the database.

Filters incoming Gateway Message and Room Message according to the partition rule, updates local routing cache, and forwards messages to the appropriate Gateways.

Supports partition expansion by doubling the partition count and synchronizing state via Zookeeper.

1.5 Router

Registers under /pubsub/router , discovers Broker partitions, and watches for topology changes.

Forwards Gateway Message to every Broker replica in the target partition, ensuring consistent routing tables.

Maintains the maximum GatewayMsgID seen to discard stale messages.

1.6 Gateway

Loads registry address, discovers Router replicas via /pubsub/router/ , and watches for changes.

Sends login/logout notifications as Gateway Message to the Router.

Receives Room Message and de‑duplicates them using a shared‑memory LRU cache of recent MessageID s before delivering to connected clients.

2. System Stability

Latency is measured by injecting synthetic users that periodically send messages through the Proxy and aggregating timestamps at a synthetic Gateway. Heartbeat packets are exchanged between each layer (Client → Proxy → Router → Broker → Gateway) to detect failures quickly.

3. Message Reliability

Critical commands (e.g., game actions) are sent three times to different Proxies, each message carries a Snowflake ID, and the system relies on UDP retransmission and de‑duplication to achieve reliability without sacrificing throughput.

4. Router Cluster

When multiple group‑chat clusters are deployed, a dedicated Router cluster is introduced. The Router now uses partition‑replica design, non‑leader replication, and a two‑fold expansion strategy. The Router forwards Gateway Message to all Broker replicas in the target partition and writes metadata to a database asynchronously.

5. Offline Message Support

Two new services are added:

Pi – stores per‑user ordered lists of message IDs.

Xiu – stores the actual message payloads, assigns a 64‑bit MsgID (16‑bit partition, 10‑bit reserve, 38‑bit sequence), and persists data to RocksDB.

Clients retrieve offline messages by requesting a range of IDs from Pi and then fetching the corresponding payloads from Xiu. Both services use partition‑replica architecture with leader‑only writes and follower synchronization via heartbeats.

6. Single‑Chat Extension

A new Relay module (partitioned by user ID) stores per‑user “Relay Data” and forwards both group and one‑to‑one messages to the appropriate Gateway. The original Broker no longer sends messages directly to Gateways; it forwards them to Relay, which then delivers them to the client.

7. Summary and Open Tasks

UDP transport is unreliable – pending fix.

Current load‑balancing uses simple round‑robin; weighted algorithms based on success rate and latency are planned.

Message deduplication based on IDs is not yet implemented.

Heartbeat mechanisms are being added to reduce reliance on the registry.

Offline‑message handling has been prototyped.

Priority handling, metrics collection, and dynamic hot‑cold load‑balancing are on the roadmap.

backendDistributed SystemsIMscalable architectureGroup Chat
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.