Backend Development 13 min read

Design and Implementation of Zhihu's Long Connection Gateway

This article explains how Zhihu designed a scalable long‑connection gateway that decouples business logic via a publish‑subscribe model, implements ACL‑based authorization, ensures message reliability with acknowledgments and sliding windows, and leverages OpenResty, Redis, and Kafka for load‑balanced, fault‑tolerant backend services.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Implementation of Zhihu's Long Connection Gateway

Design of the Communication Protocol

Zhihu's long‑connection gateway uses a publish‑subscribe model to decouple client, gateway, and business back‑ends. Topics are the only contract, allowing binary payloads without the gateway needing to understand business protocols. ACL rules enforce fine‑grained permission, supporting callback‑based authentication and topic‑template variables for per‑user isolation.

Business Decoupling

The many‑to‑many relationship between clients and services is kept loosely coupled by publishing and subscribing to topics, avoiding protocol entanglement and simplifying upgrades.

Permission Control

Authorization is performed via HTTP callbacks to the business service (e.g., Zhihu Live) and by embedding user identifiers in topic names, enabling the gateway to decide access without contacting the business side.

Message Reliability Guarantees

Messages are acknowledged by clients; the gateway stores unacknowledged messages in Redis and retries until ACK is received. For high‑throughput scenarios, a Kafka‑based pipeline is used to avoid per‑message ACK overhead.

System Architecture

The architecture consists of four core components: an OpenResty access layer for load balancing and session stickiness, a containerized long‑connection broker handling protocol parsing, authentication, and pub/sub logic, Redis for persisting session state, and Kafka as the internal message bus.

Access Layer

OpenResty (Lua‑enabled Nginx) performs 4‑layer load balancing based on client identifiers extracted via Nginx's preread mechanism, ensuring consistent routing even after network changes.

Publish and Subscribe

Kafka routes messages between brokers and business services, supporting four routing patterns: fire‑and‑forget, normal IM, pure downstream, and filtered preprocessing. This design provides high scalability and reliability.

Subscription Management

Initial implementation used a single HashMap with a global lock, which became a bottleneck. The solution shards the map into hundreds of HashMaps, reducing lock contention and improving throughput.

Session Persistence

Sessions store QoS‑1 messages in Redis until the client ACKs them, allowing seamless failover when a broker restarts.

Sliding Window Transmission

Inspired by TCP, a sliding‑window mechanism permits multiple messages to be in flight simultaneously, increasing throughput while preserving order because the underlying transport is TCP.

Overall, the gateway balances reliability, horizontal scalability, and component maturity, leveraging widely adopted open‑source technologies to handle millions of concurrent connections and massive message volumes.

BackendscalabilityKafkaMessaginggatewayLong ConnectionOpenResty
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.