Design and Evolution of Baidu Comment Middleware: High‑Availability, High‑Performance Distributed Service Architecture
This article details the architecture, service design, performance optimizations, sorting mechanisms, and stability strategies of Baidu's comment middle‑platform, illustrating how it evolved to support billions of daily requests with high availability, low latency, and scalable distributed services.
Baidu's comment middle‑platform provides a unified, high‑throughput comment capability for Baidu products, handling daily traffic at the hundred‑billion scale while ensuring stable, performant service.
Comments serve as essential user‑generated content that drives interaction; building such a system requires supporting diverse product needs and maintaining reliable operation.
The platform defines a comment lifecycle model covering topics, comment data, metadata such as author, hierarchy, visibility, likes, replies, and timestamps, enabling flexible data retrieval and analysis.
Initially a product‑specific service, the comment system was refactored into a centralized middle‑platform with a gateway layer for access control, traffic coloring, flow control, and fault detection, adopting low‑coupling, high‑cohesion interfaces and targeting a 99.99% SLA with isolated zones and N+1 capacity planning.
To improve performance, list‑type interfaces were redesigned using a three‑step DAG‑based dependency scheduler: defining atomic service nodes, constructing a directed acyclic graph to represent dependencies, and executing nodes in parallel via topological sorting, resulting in reduced latency and better maintainability.
Comment ranking combines offline coarse‑ranking and online fine‑ranking: scores are computed offline using configurable formulas and stored in Redis ZSETs, while online personalization adjusts rankings per user, supporting multiple strategy groups for A/B testing.
Stability is achieved through multi‑level caching (local LRU, Redis, empty cache), hotspot detection with automatic identification and cache rebuilding, and comprehensive disaster‑recovery mechanisms including active/passive degradation, instance locking, and chaos‑engineering practices, maintaining >99.995% availability under peak loads.
Today the platform serves over 20 Baidu products, peaks at 400k QPS, processes daily PV in the billions, and continues to evolve in service innovation, middle‑platform construction, and reliability engineering.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
