How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System
This article details Baidu's Feed recommendation system architecture, explaining how a combination of dynamic retry scheduling, real‑time stop‑loss mechanisms, multi‑recall frameworks, ranking layer fallbacks, and IDC‑level multi‑master designs collectively ensure five‑nine availability across billions of daily requests.
Background
Baidu Feed powers the information‑flow recommendation for most of its products (Handbook, Haokan, Quanmin, Tieba, etc.), handling tens of billions of requests daily. The service relies on hundreds of micro‑services and tens of thousands of machines, making high availability a core architectural goal.
Overall Design
To meet a constant 99.999% availability target, Baidu built a flexible, multi‑level fault‑handling architecture that can address everything from single‑instance timeouts to IDC‑wide outages.
Instance‑Level Fault Solutions
Dynamic Retry Scheduling
The main challenge of retry mechanisms is setting the retry timeout and avoiding cascade failures. Baidu implements a dynamic retry scheduler that limits retry traffic to a configurable proportion (e.g., 3%) and uses real‑time latency percentiles to decide which requests should be retried, eliminating the need for static timeout tuning.
Short retry windows waste resources and can trigger downstream avalanches.
Long retry windows increase overall latency and may cause timeout inversion.
The dynamic scheduler balances these trade‑offs by adapting to current latency distributions.
Real‑time Stop‑Loss for Single Instances
Beyond retries, Baidu adds a real‑time stop‑loss layer that detects unhealthy instances via availability and latency feedback. Unhealthy instances have their traffic weight reduced instantly, while healthy instances receive smooth weight adjustments based on load, ensuring rapid convergence within seconds.
Weight‑based isolation reduces the impact of failing instances.
Latency‑based smoothing prevents over‑penalizing instances during transient spikes.
Integration with the internal BRPC framework enables fast collection of per‑instance metrics and centralized control.
Service‑Level Fault Solutions
Multi‑Recall Scheduling Framework
Recall is divided into three levels: first‑level (critical, no discard), second‑level (grouped by resource type, partial discard allowed), and third‑level (optional, can be discarded). A "drop‑layer" discards unresponsive recall paths, while a cache‑based compensation mechanism reuses previous results to minimize loss.
Recall level classification controls which calls may be dropped.
Drop‑layer stops waiting for unresponsive paths.
Cache‑backed compensation reduces the impact of discarded recalls.
Ranking Layer Fault Handling
The ranking service sits after recall and uses a coarse‑ranking + fine‑ranking two‑stage funnel. When coarse or fine ranking fails at large scale, Baidu falls back to a stable proxy router and uses offline‑derived point‑wise scores or the coarse‑ranking model as an emergency sorter.
Introduce a stable middle‑proxy router for quick failover.
Coarse‑ranking fallback uses cached point‑wise scores.
Fine‑ranking fallback directly switches to the coarse‑ranking model.
IDC‑Level Fault Solutions
For data‑center outages, Baidu adopts an active‑active multi‑master architecture for the delivery‑history storage service. Each IDC maintains a full copy; reads are served locally, while writes are synchronously replicated across IDC and, on failure, queued for asynchronous replication.
Each region holds a complete data replica.
Read requests are confined to the local IDC.
Write requests are synchronously replicated; failures fall back to a message‑queue‑driven async replication.
Conclusion
Through a combination of flexible retry control, real‑time stop‑loss, multi‑recall scheduling, ranking fallbacks, and IDC‑wide multi‑master replication, Baidu's Feed recommendation system consistently achieves five‑nine availability and can gracefully handle both localized and large‑scale failures.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
