Design and Evolution of Baidu Comment Platform: Architecture, Performance Optimization, and Stability
This article details the architecture, design principles, performance enhancements, and reliability strategies of Baidu's comment middle‑platform, which supports billions of daily requests across dozens of products while ensuring high availability, low latency, and continuous iterative development.
Baidu Comment Platform provides a unified, high‑throughput comment capability for Baidu’s ecosystem, handling daily traffic at the hundred‑billion level and serving more than 20 products.
Background: Comments are a core user‑generated content channel that drives interaction and product stickiness; the platform must support diverse product forms and rapid feature iteration while maintaining stable performance.
Concept Introduction: The comment lifecycle includes production, moderation, storage, and retrieval. Data is sharded across distributed databases, with vertical and horizontal partitioning to meet read‑write scalability, and later aggregated into a big‑data warehouse for analytics.
Service Design: Originally built for Baidu App, the service was re‑positioned as a middle‑platform. A unified gateway layer offers traffic isolation, flow coloring, rate limiting, and fault detection. Interfaces are abstracted for low coupling and high cohesion, and a plug‑in framework supports access control, traffic billing, and after‑sales support.
4.1 Improving Performance and Iteration Efficiency: The list‑type API was refactored by separating responsibilities, parallelizing downstream calls, and introducing a DAG‑based scheduler that executes independent nodes concurrently, dramatically reducing 99th‑percentile latency.
4.2 Building High‑Performance, Low‑Latency Ranking: An offline scoring service computes comment scores using configurable formulas and stores results in Redis ZSET indexes. Online fine‑tuning adds personalized boosts and a fan‑out model in Go for full‑re‑ranking, enabling rapid A/B testing without impacting service latency.
4.3 Ensuring Continuous Stability: Multi‑level caching (local LRU, Redis, empty cache) mitigates cache penetration and breakdown. Hot‑spot detection automatically promotes heavily commented resources to a special pipeline, while disaster‑recovery and degradation mechanisms (active/passive triggers, circuit breakers, chaos engineering) protect the system during traffic spikes or downstream failures.
Conclusion: The platform now delivers peak QPS >400k, daily PV >100 billion, and SLA >99.995%. Ongoing work focuses on further innovation, middle‑platform expansion, and stability improvements to sustain a high‑quality community experience.
Architecture Digest
Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.