Backend Development 11 min read

Design and Implementation of a Real-Time Customer Service IM System

This article analyzes the challenges of real-time, reliability, and message ordering in a customer service instant messaging system and presents solutions such as WebSocket adoption, ACK-based reliability, heartbeat mechanisms, and a structured message protocol to ensure efficient and stable communication.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Design and Implementation of a Real-Time Customer Service IM System

Introduction

In today's internet era, efficient user service is key to improving user experience. ZhiZhuan's self-developed customer service IM system acts as a bridge between users and support agents, but message flow faces latency, loss, disorder, or duplication issues. This article explores these challenges and the technical solutions applied.

Compared with ordinary web systems, the IM message flow is longer and more complex, involving client‑to‑server and server‑to‑client stages; any fault can affect real‑time delivery, reliability, and completeness.

Real‑time

The primary concern is message latency: after a message is sent, the system must deliver it to the recipient as quickly as possible while minimizing resource consumption.

Solution 1: Long/Short Polling

Early PC web applications used a request‑response model with frequent short‑interval polling, which is easy to implement but wastes client bandwidth and server resources. Long polling improves this by keeping the connection open on the server until new data arrives, reducing unnecessary client requests, yet server load remains high.

Solution 2: WebSocket

With HTML5, full‑duplex WebSocket enables a persistent connection after a single handshake, allowing real‑time bidirectional data transfer.

Long/Short Polling

WebSocket

Browser support

✅ All browsers

Most modern browsers

Server load

High

✅ Depends on message volume

Client latency

Non‑real‑time; depends on poll interval

✅ Real‑time

Client resource consumption

Large

✅ Low

Implementation complexity

✅ Low

High

ZhiZhuan's IM system adopts WebSocket; when the server receives a new message, it pushes it through the established WebSocket connection, ensuring real‑time delivery.

Reliability

The message sending process consists of two phases: client‑to‑server and server‑to‑client. Any failure in these steps can cause message loss.

How to guarantee message delivery?

We employ the TCP/IP ACK mechanism to prevent loss. The ACK workflow is:

The sender attaches a unique identifier (sender ID + timestamp) and keeps a "waiting ACK" list locally. The receiver stores the message and obtains a message ID. The receiver sends the identifier back to the sender as an ACK. Upon receiving the ACK, the sender removes the entry from the waiting list. If no ACK arrives within a timeout, the sender retries or marks the message as failed.

On the server side, maintaining ACK lists for all users is inefficient; we use delayed MQ messages to trigger retries when ACKs are missing.

When a message is sent, a delayed MQ is also published. If the delayed MQ is consumed while the message is still in the waiting list, it indicates the ACK was not received, and the system retries sending.

Additionally, the client reloads or reconnects via WebSocket to fetch the full conversation history, ensuring no messages are lost.

Data Duplication

While retransmission solves loss, lost ACKs can cause duplicate messages. To avoid this, both push and retransmitted messages share the same unique ID, allowing the receiver to deduplicate.

Sender: Uses sender ID and timestamp as a unique ACK identifier and sends it to the server. Server: Generates a message ID (e.g., via Snowflake), maps the ACK identifier to this ID, and pushes the ID to both sender and receiver.

A complete message flow diagram is shown below.

Heartbeat Mechanism

Both sending and receiving rely on long‑lived connections; if a connection drops, communication fails. To manage limited human agent resources, we implement an application‑level heartbeat.

Heartbeat design:

Client: After establishing a connection, start a timer to send a heartbeat every 30 seconds. Server: Receive heartbeat and update the last‑heartbeat timestamp. Run a scheduled task to scan online users: If the last heartbeat exceeds 30 seconds, mark the user offline and close the connection. If it exceeds 2 minutes, consider the user fully left and release the agent resource.

The heartbeat payload is minimal, containing only status information.

Message Protocol

A well‑designed message format is essential for readability and extensibility. We categorize message data into several parts:

Message type (e.g., heartbeat, user‑to‑agent, system). Agent ID. User ID. Message content, which varies by type (text, image, video, order card, coupon, etc.).

Conclusion

By adopting WebSocket, ACK mechanisms, message retransmission, and deduplication strategies, ZhiZhuan's customer service IM system ensures real‑time, reliable, and complete message delivery, enhancing communication efficiency and providing stable support for the platform. Future work will continue to optimize the system to meet evolving user expectations.

Real-time Messagingbackend developmentIM systemWebSocketheartbeatACK mechanism
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.