Backend Development 21 min read

Design and Implementation of Baidu's Unified Long‑Connection Service

Baidu’s Go‑based unified long‑connection service delivers secure, high‑concurrency, low‑latency connections for multiple Baidu apps through a four‑layer architecture (SDK, control, access, routing), employing goroutine pooling, two‑layer connection models and binary routing to support tens of millions of concurrent users and million‑level QPS, while simplifying integration and reducing maintenance costs.

Baidu Geek Talk
Baidu Geek Talk
Baidu Geek Talk
Design and Implementation of Baidu's Unified Long‑Connection Service

In the mobile‑Internet era, user expectations for real‑time and interactive services have driven the need for high‑performance long‑connection capabilities. This article presents Baidu’s internal unified long‑connection service, implemented in Go, and discusses its functional design, performance optimizations, and operational experience.

Abstract The unified long‑connection service provides a secure, high‑concurrency, low‑latency, and easy‑to‑integrate solution for multiple Baidu apps (live streaming, messaging, push, cloud control, etc.). It eliminates duplicated development, reduces maintenance cost, and ensures professional, stable long‑connection capabilities across business lines.

Key Goals

Support the major Baidu APP scenarios with a unified, secure long‑connection capability.

Guarantee high concurrency, high stability, and low latency.

Enable multi‑business reuse of a single connection to reduce resource consumption.

Provide a simple, clear integration process for downstream services.

Functional Overview

Connection establishment, maintenance, and management.

Upstream request forwarding.

Downstream data push (unicast, batch‑unicast, broadcast).

Challenges

The service must meet low‑latency, high‑concurrency, and high‑stability requirements while supporting many business scenarios. Maintaining separate long‑connection implementations for each business would cause duplicated effort and hinder rapid feature iteration.

Architecture

The system consists of four layers:

Unified Long‑Connection SDK (client side) – obtains token and endpoint from the control layer, establishes and maintains the connection, forwards business SDK requests, and receives server‑pushed data.

Control Layer – validates device legitimacy, issues tokens, selects appropriate access points and protocols, and performs traffic control.

Access Layer – core long‑connection service handling connection admission, maintenance, request forwarding, and downstream push. It manages connection‑ID ↔︎ connection‑info mapping, group‑ID ↔︎ connection‑info mapping, and separates read/write goroutine pools.

Routing Layer – maintains device‑ID ↔︎ connection‑info mapping to enable targeted push.

Core Process

Connection establishment – SDK obtains token & endpoint from the control layer, then connects to the access layer.

Connection maintenance – periodic heartbeat from the SDK keeps the connection alive.

Upstream request – business SDK sends a request, the SDK packages it, and the access layer forwards it to the appropriate business server.

Downstream push – business server sends a push request, the routing layer resolves the target connection, the access layer writes the data, and the SDK delivers it to the business SDK.

Performance Optimizations

Support for millions of concurrent connections and tens of thousands of QPS for connection establishment, upstream, and downstream traffic.

Introduction of a request‑forwarding group and a downstream‑task group to avoid a single goroutine becoming a bottleneck and to reduce the total number of goroutines per instance.

Two‑layer connection model (connection layer + session layer) isolates business logic from the underlying transport (TCP, TLS, WebSocket, QUIC), allowing seamless protocol upgrades.

State‑machine based connection lifecycle management ensures reliable reconnection and clear state transitions.

Multi‑Business Support

A private binary protocol is used, consisting of a header, common fields (device ID, app ID, business ID, metadata), and business‑specific payload. By parsing the business ID, the service can route data to the correct backend without interpreting business logic.

Deployment

Access points are deployed in East, North, and South China, plus a Hong Kong node for overseas traffic.

Clusters are sized per business importance; critical services have dedicated clusters, while secondary services share resources.

Each instance caps active connections at 100k‑200k to limit goroutine count and GC pressure.

Business Integration

Assess required capabilities (unicast, batch‑unicast, group‑cast, upstream support).

Estimate user scale to plan resources.

Integrate the client SDK.

Adapt server‑side interfaces according to the selected capabilities.

Request resources and launch the service.

Summary & Future Plans

The unified long‑connection service now handles tens of millions of concurrent connections and supports million‑level upstream QPS and downstream UPS. It has proven stable during large‑scale events. Future work focuses on finer‑grained network quality metrics, intelligent adaptive connection parameters, and expanding to new business scenarios.

distributed systemsperformance optimizationbackend architecturegolanghigh concurrencyLong Connection
Baidu Geek Talk
Written by

Baidu Geek Talk

Follow us to discover more Baidu tech insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.