Backend Development 18 min read

Architecture and High‑Concurrency Optimization Practices of NetEase IM Cloud Service

This article explains the layered architecture, connection management, high‑availability design, security mechanisms, and performance‑boosting techniques of NetEase's instant‑messaging cloud platform, illustrating how the system handles massive concurrent connections and ensures stable, fast, and secure message delivery.

Architecture Digest
Architecture Digest
Architecture Digest
Architecture and High‑Concurrency Optimization Practices of NetEase IM Cloud Service

Related Recommendations

IM push guarantee and network optimization details (1): How to achieve background keep‑alive without affecting user experience (http://netease.im/blog/im1-0604/)

IM push guarantee and network optimization details (2): How to combine long‑connection and push (http://netease.im/blog/im2-0607/)

IM push guarantee and network optimization details (3): How to optimize large data transmission in weak network environments (http://netease.im/blog/im3-0608/)

Key Points of This Article

Analysis of NetEase Cloud IM overall architecture

Client connection and access‑point management in Cloud IM

Service‑oriented design and high availability

NetEase IM Cloud Layered Architecture Diagram Analysis

1. Client SDK layer : Covers Android, iOS, Windows PC, web, and embedded devices. The SDK uses a four‑layer TCP protocol and a seven‑layer Socket.IO protocol (for Web SDK long‑connection). It also provides HTTP‑based API interfaces for third‑party servers and a UDP‑based A/V SDK for real‑time audio/video.

2. Gateway layer : Provides direct client access and maintains long connections. Web SDK connects to the WebLink service (Socket.IO based), while Android/iOS/PC SDKs connect to the Link service (TCP based). The layer also includes LBS (location‑based service) for optimal gateway selection and API services for third‑party requests.

How to ensure stability?

The NetEase Cloud IM SDK uses a long‑connection mechanism with heartbeat detection and automatic reconnection. It optimizes for weak networks, using TCP for mobile/PC clients and Socket.IO for web clients.

How to achieve security?

All data transmitted over the public network is encrypted. During connection establishment, the client generates a one‑time key, encrypts it with asymmetric encryption, and sends it to the server. The server decrypts it, and the key is then used for stream‑level encryption of all subsequent traffic, preventing MITM and replay attacks.

How to ensure speed?

Clients first use LBS to select the nearest gateway, then establish a long connection that greatly speeds up message flow. Data packets are compressed during transmission, and the SDK provides auto‑login and reconnection to reduce latency during foreground‑background switches.

Process of Establishing a Long Connection Between Client and Server

The SDK first requests the LBS service to obtain a list of gateway addresses based on appkey, client IP, SDK version, and environment tags. It then attempts connections sequentially, caching the last successful address list to accelerate future connections. If all addresses fail, it falls back to a default Link address.

After obtaining a target address, the client establishes a TCP long connection, negotiates an encryption key, and sends an authentication packet. Successful authentication results in a secure, usable channel for RPC calls and server‑push messages; failure leads to connection termination.

Accelerated Nodes

Accelerated nodes replace unpredictable backbone links with higher‑quality ISP lines, reducing latency and improving stability, especially for cross‑region connections (e.g., a US client accessing a gateway in Hangzhou).

Message Delivery Concurrency Enhancement

Two delivery models are discussed:

1. Point‑to‑point Link : Each message requires the business layer to locate the recipient’s Link server, which becomes a bottleneck for large groups.

2. Broadcast Link : Members of the same chatroom are assigned to the same group of access points; the App only needs to broadcast to the set of Links for that room, greatly improving throughput.

Images illustrating these models are included in the original article.

WebLink Evolution and Optimization

Initial solution: Single domain with SSL‑encrypted WebLink nodes behind LVS and Keepalived for HA. This lacked flexibility and was vulnerable to DDoS.

Second solution: Each WebLink node gets an independent domain, with LBS assigning clients to appropriate nodes, improving flexibility but still limited by per‑node SSL overhead.

Third solution (current): Front‑end Nginx performs layer‑7 proxying with SSL, while a pool of WebLink nodes handles traffic. LBS still assigns SDKs to access points, combining flexibility with improved performance.

Service‑Oriented and High‑Availability Practices for the Instant‑Messaging Platform

The gateway layer maintains stateless long connections and forwards requests. A routing layer decouples the gateway from business services, registering business nodes in a service registry and selecting appropriate nodes for each request, enhancing elasticity.

Business nodes are deployed across multiple network environments for fault tolerance; if one environment fails, the routing layer redirects traffic to healthy clusters.

Gray‑release and dedicated‑service capabilities are supported by directing specific user traffic to upgraded or isolated clusters via routing configurations.

Source: https://juejin.im/post/5b1e2cc15188257d4529804b

Copyright statement: Content originates from the web, copyright belongs to the original author. We will delete any infringing material upon notice.

Cloud Servicesbackend architecturescalabilityhigh concurrencysecurityInstant MessagingConnection Management
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.