Insights from the 58 Group Technical Salon: Cloud Platform Architecture Practices by Zhihu and 58
The article summarizes the 58 Group technical salon where Zhihu and 58 teams shared their cloud platform evolution, containerization strategies, multi‑cluster management, network architecture, service discovery, and real‑world case studies, highlighting challenges and solutions for large‑scale Kubernetes deployments.
On February 27, 2019, the 58 Group Technical Salon (Session 9 – “Cloud Platform Architecture”) was held at the Beijing headquarters, featuring presentations by Zhihu’s container platform team and 58’s TEG cloud platform team on their container‑cloud practices.
Zhihu Cloud Platform Practice
Zhihu began containerizing production workloads in 2015, using Mesos initially and migrating to Kubernetes by the end of 2017. The platform now runs business containers as well as infrastructure services such as Kafka and HBase.
The service framework relies on Consul for service registration/discovery and HAProxy for load balancing, enabling rate‑limiting and circuit‑breaking.
To overcome Kubernetes cluster size limits, Zhihu supports multi‑cluster management, horizontal scaling, cross‑cluster disaster recovery, and hybrid‑cloud integration for burst traffic.
Case Study – etcd Failure
High event volume in large clusters stressed etcd, causing outages. Solutions included event isolation to a separate etcd cluster, regular cleanup, and upgrading storage to SSDs.
Case Study – Kubernetes Eviction
Node heartbeat loss can trigger massive container migrations. The "unhealthy‑zone‑threshold" parameter limits eviction scope to mitigate impact.
Infrastructure Containerization – Kafka
Kafka was containerized using HostPath storage and a custom LocalPV resource with a disk‑aware scheduler. An API creates Kafka brokers, the scheduler selects appropriate nodes/disks, writes LocalPVPod to etcd, and monitors pod status for fault handling.
Future Outlook (Zhihu)
Further infrastructure containerization and server utilization optimization are planned.
58 Cloud Platform Practice
Started in early 2017 to address low resource utilization, slow scaling, and inconsistent release processes. Built on containers and Kubernetes, the platform now serves over 2,000 services, runs on 430 physical machines, and hosts tens of thousands of containers.
Network Architecture
Adopts a "bridge+VLAN" model with a custom IP controller to provide fixed IPs for services, integrating with Tencent data‑center networking for full‑mesh container routing.
Network Rate Limiting
Implemented bidirectional traffic shaping by applying tc limits to both ends of the veth pair, enabling dynamic, per‑second, and elastic bandwidth control.
Service Discovery
Uses Consul for decoupled service registration and HAProxy load balancing, with a proxy layer watching Kubernetes events to keep IP mappings up‑to‑date, allowing any language service to join without code changes.
Case Study – Load Isolation
Introduced container‑level thread caps and host‑level overload protection to prevent excessive host load from affecting other services.
Case Study – Swap‑Induced Latency
Disabled swap partitions to eliminate random latency spikes in early cloud migration stages.
Future Outlook (58)
Plans include intelligent scheduling, stateful service support, and deeper hybrid‑cloud integration.
Conclusion
The salon facilitated deep technical exchange between Zhihu and 58, revealing common challenges in cloud‑native transformation and sharing targeted solutions that advance large‑scale container cloud adoption.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
58 Tech
Official tech channel of 58, a platform for tech innovation, sharing, and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
