How Zhihu Built a Scalable Long‑Connection Gateway for Real‑Time Messaging

Zhihu’s infrastructure team designed a high‑performance, scalable long‑connection gateway that decouples business logic via publish‑subscribe, leverages OpenResty, Kafka, and Redis, implements fine‑grained ACL, sliding‑window flow control, and ensures message reliability and horizontal scalability for millions of concurrent devices.

Programmer DD
Programmer DD
Programmer DD
How Zhihu Built a Scalable Long‑Connection Gateway for Real‑Time Messaging

Why Long‑Connection Technology Matters

Real‑time responses—whether a typing indicator in WeChat, a coordinated "666" in a game, or live‑stream chat—all rely on long‑connection technology.

Most internet companies run their own long‑connection systems for notifications, instant messaging, push, live comments, games, location sharing, stock quotes, etc. When multiple business lines need to share a long‑connection system, challenges arise: duplicated infrastructure, higher client power consumption, and difficulty reusing experience.

Sharing a long‑connection gateway introduces requirements for authentication, authorization, data isolation, protocol extension, delivery guarantees, forward‑compatible protocol evolution, and capacity management.

How Do We Design the Communication Protocol?

Business Decoupling

The gateway connects many clients and many backend services via a single long‑connection channel. To avoid tight coupling, we adopt a classic publish‑subscribe model where only a Topic needs to be agreed upon. Messages are pure binary; the gateway does not need to understand business‑specific protocols.

Permission Control

We use ACL rules with callback‑based authorization. For example, a Zhihu Live channel checks via an HTTP callback whether a user has paid before allowing subscription. To reduce integration effort, we also support Topic template variables that embed the username, allowing the gateway to enforce per‑user private topics without contacting the business service.

Message Reliability

While TCP guarantees order and reliability, failures can still cause message loss. We implement acknowledgments and retransmission: important messages are marked QoS 1, stored in Redis until the client ACKs, and the broker retries until a successful ACK is received. For high‑throughput scenarios we also provide a queue‑based delivery path via Kafka.

Our protocol is based on MQTT, extended with authentication and authorization, and remains partially compatible with MQTT clients to lower integration cost.

How Do We Design the System Architecture?

Key design goals:

Reliability

Horizontal scalability

Maturity of dependent components

We avoid a monolithic design by delegating storage, routing, and messaging to specialized components.

Core Components

Access layer built with OpenResty for load balancing and session persistence.

Long‑connection broker (containerized) handling protocol parsing, auth, session management, and pub‑sub logic.

Redis for persisting session data.

Kafka as the message queue for distributing messages to brokers or business services.

Both Kafka and Redis are industry‑standard, container‑ready, and support rapid scaling.

Access Layer Details

The access layer performs two tasks: load balancing across broker instances and session stickiness so a client always reaches the same broker.

We use 7‑layer load balancing based on a unique client identifier extracted from the first packet via Nginx’s preread buffer, achieving consistent hashing with minimal intrusion.

Publish and Subscribe Mechanics

Kafka acts as the internal hub. Four routing scenarios are supported:

Route to a Kafka topic without consumption (data reporting).

Route to a Kafka topic and consume it (instant messaging).

Consume directly from a Kafka topic for push‑only scenarios.

Route to one topic and consume from another for filtering/pre‑processing.

This flexible routing, combined with Kafka’s reliability, handles virtually any messaging need.

Subscription Storage Optimization

Initially we stored subscriptions in a single HashMap protected by a global lock, which caused severe contention. We sharded the map into hundreds of smaller maps, each with its own lock, dramatically reducing conflicts and improving performance.

Session Management

When a message is delivered to a session, the broker checks if it is a high‑priority topic, marks it QoS 1, stores it in Redis’s pending queue, and waits for the client ACK before removal. This ensures durability across broker restarts.

Sliding‑Window Flow Control

Inspired by TCP, we limit the number of in‑flight QoS 1 messages with a configurable sliding window, allowing parallel transmission while preserving order. Retransmission only occurs after client reconnection, and the receiver de‑duplicates messages.

Conclusion

The infrastructure team provides a rock‑solid foundation for Zhihu’s massive traffic, delivering reliable, scalable real‑time communication for millions of users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KafkaMessage Reliabilityscalable architecturelong-connectionOpenRestyPublish-Subscribe
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.