Design and Implementation of an Online Customer Service Instant Messaging System
This article details the design and implementation of an online customer service instant messaging system, covering requirements analysis, client‑server network model, HTTP and WebSocket protocols, distributed architecture choices such as routing rules, Redis shared memory, server‑master sync, and message‑queue broadcasting, and explains why Netty was selected as the development framework.
1. Overview
The company operates both telephone and online customer service. The telephone service is provided by the YTalk platform, which has been in production since 2020. Online service currently relies on a third‑party provider, increasing cost and limiting customization. To address these issues, an in‑house online customer service system was launched on June 20, 2021, and continuously enriched thereafter.
The core of the online service is an instant messaging (IM) system that enables real‑time text, file, and voice communication. This article shares the design considerations, technology choices, and pitfalls encountered while building the IM system, aiming to deepen readers' understanding and provide practical guidance.
2. System Design
2.1 Requirements Analysis
The IM system must support two user groups: customer service agents and customers (who may connect via web, app, or WeChat H5). Chat sessions are initiated by customers and assigned to an available agent. Agents can query the full chat history of any user. Voice and video are not required, and direct peer‑to‑peer chats between customers are prohibited.
2.2 Network Model
Client-Server : Clients communicate through a server that forwards messages. The server holds all connection information, simplifying monitoring but making the server's maximum connection count a performance bottleneck.
Peer to Peer : Direct client‑to‑client connections provide privacy and unlimited scaling but require NAT traversal and are unsuitable for the limited, agent‑centric communication pattern of a customer service system.
Given the constraints, the Client-Server model was chosen.
2.3 Application Layer Protocol
Both HTTP long‑polling and WebSocket can deliver messages, but WebSocket offers full‑duplex, persistent connections after a single handshake, making it a better fit for IM.
2.4 Distributed Architecture
To avoid a single‑point failure and support scaling, several distributed solutions were evaluated.
2.4.1 Modulo Routing Rule
Clients are assigned to a server based on a simple modulo of their identifier. This approach is easy to implement and has low overhead, but it suffers from load imbalance and requires all clients to reconnect when server count changes.
2.4.2 Redis Shared Memory
A central Redis store maintains a clientId:serverId mapping. When a client connects, the server writes the mapping; on disconnect, it deletes it. Because Redis transactions are not fully atomic, Lua scripts or explicit locking are needed to avoid race conditions during rapid reconnects.
delete:
if(delete.id == old.id) {
del(delete.id); // ensure T2 delete does not overwrite T3 store
}
store:
if(newConnection.time >= old.time) {
store(newConnection.id, newConnection); // ensure T1 store does not overwrite T3 store
}This method introduces latency in routing‑table updates, which can cause temporary message delivery failures.
2.4.3 Server‑Master Synchronization
Each server stores its local connections and synchronizes them to a designated master server, which holds the global routing table. The master forwards messages to the appropriate server. A single master simplifies consistency, but introduces a single point of failure; therefore a master‑backup pair or a distributed consensus protocol is required for high availability.
2.4.4 Broadcast Strategy via Message Queue
Instead of selecting a target server, every server publishes incoming messages to a message‑queue topic. All servers consume the topic and deliver the message locally if they hold the corresponding client connection. This eliminates the need for a global routing table and simplifies failure handling, though the message queue can become a bottleneck under extreme load.
// Server receives a message
public void receive(Message message) {
mqTopic.send(message);
}
// Server listens to the queue
mqTopic.addListener(message -> sendToClient(message));
public void sendToClient(Message message) {
Channel channel = map.get(message.getUserId());
if (channel != null) {
channel.write(message);
}
}The broadcast approach was chosen as the most suitable for the online customer service system's traffic profile.
3. Development Framework
Java offers three I/O models: BIO (blocking), NIO (non‑blocking), and AIO (asynchronous). BIO provides low performance, and AIO lacks mature Linux support, so NIO was selected.
Netty abstracts Java NIO, handling connection lifecycle, half‑packet processing, reconnection, congestion control, and more. It supports WebSocket out of the box, aligns with the company's Java stack, and offers high performance through a reactor thread model, zero‑copy, and memory pooling. Consequently, Netty was adopted as the primary framework for the IM system.
4. Conclusion
System design involves evaluating multiple architectural options and balancing business requirements with technical trade‑offs. This article presented the reasoning behind the chosen network model, protocol, distributed routing strategy, and development framework for an online customer service IM system, providing insights that can guide similar projects.
Yang Money Pot Technology Team
Enhancing service efficiency with technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.