Design and Evolution of Baidu Comment Middleware: High‑Availability, High‑Performance Distributed Service Architecture

This article details the architecture, service design, performance optimizations, sorting mechanisms, and stability strategies of Baidu's comment middle‑platform, illustrating how it evolved to support billions of daily requests with high availability, low latency, and scalable distributed services.

Top Architect
Top Architect
Top Architect
Design and Evolution of Baidu Comment Middleware: High‑Availability, High‑Performance Distributed Service Architecture

Baidu's comment middle‑platform provides a unified, high‑throughput comment capability for Baidu products, handling daily traffic at the hundred‑billion scale while ensuring stable, performant service.

Comments serve as essential user‑generated content that drives interaction; building such a system requires supporting diverse product needs and maintaining reliable operation.

The platform defines a comment lifecycle model covering topics, comment data, metadata such as author, hierarchy, visibility, likes, replies, and timestamps, enabling flexible data retrieval and analysis.

Initially a product‑specific service, the comment system was refactored into a centralized middle‑platform with a gateway layer for access control, traffic coloring, flow control, and fault detection, adopting low‑coupling, high‑cohesion interfaces and targeting a 99.99% SLA with isolated zones and N+1 capacity planning.

To improve performance, list‑type interfaces were redesigned using a three‑step DAG‑based dependency scheduler: defining atomic service nodes, constructing a directed acyclic graph to represent dependencies, and executing nodes in parallel via topological sorting, resulting in reduced latency and better maintainability.

Comment ranking combines offline coarse‑ranking and online fine‑ranking: scores are computed offline using configurable formulas and stored in Redis ZSETs, while online personalization adjusts rankings per user, supporting multiple strategy groups for A/B testing.

Stability is achieved through multi‑level caching (local LRU, Redis, empty cache), hotspot detection with automatic identification and cache rebuilding, and comprehensive disaster‑recovery mechanisms including active/passive degradation, instance locking, and chaos‑engineering practices, maintaining >99.995% availability under peak loads.

Today the platform serves over 20 Baidu products, peaks at 400k QPS, processes daily PV in the billions, and continues to evolve in service innovation, middle‑platform construction, and reliability engineering.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance optimizationService Architecture
Top Architect
Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.