Operations 18 min read

How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System

This article details Baidu's Feed recommendation system architecture, explaining how a combination of dynamic retry scheduling, real‑time stop‑loss mechanisms, multi‑recall frameworks, ranking layer fallbacks, and IDC‑level multi‑master designs collectively ensure five‑nine availability across billions of daily requests.

Baidu Geek Talk

Mar 22, 2021

How Baidu Achieved 99.999% Uptime for Its Massive Feed Recommendation System

Background

Baidu Feed powers the information‑flow recommendation for most of its products (Handbook, Haokan, Quanmin, Tieba, etc.), handling tens of billions of requests daily. The service relies on hundreds of micro‑services and tens of thousands of machines, making high availability a core architectural goal.

Overall Design

To meet a constant 99.999% availability target, Baidu built a flexible, multi‑level fault‑handling architecture that can address everything from single‑instance timeouts to IDC‑wide outages.

Instance‑Level Fault Solutions

Dynamic Retry Scheduling

The main challenge of retry mechanisms is setting the retry timeout and avoiding cascade failures. Baidu implements a dynamic retry scheduler that limits retry traffic to a configurable proportion (e.g., 3%) and uses real‑time latency percentiles to decide which requests should be retried, eliminating the need for static timeout tuning.

Short retry windows waste resources and can trigger downstream avalanches.

Long retry windows increase overall latency and may cause timeout inversion.

The dynamic scheduler balances these trade‑offs by adapting to current latency distributions.

Real‑time Stop‑Loss for Single Instances

Beyond retries, Baidu adds a real‑time stop‑loss layer that detects unhealthy instances via availability and latency feedback. Unhealthy instances have their traffic weight reduced instantly, while healthy instances receive smooth weight adjustments based on load, ensuring rapid convergence within seconds.

Weight‑based isolation reduces the impact of failing instances.

Latency‑based smoothing prevents over‑penalizing instances during transient spikes.

Integration with the internal BRPC framework enables fast collection of per‑instance metrics and centralized control.

Service‑Level Fault Solutions

Multi‑Recall Scheduling Framework

Recall is divided into three levels: first‑level (critical, no discard), second‑level (grouped by resource type, partial discard allowed), and third‑level (optional, can be discarded). A "drop‑layer" discards unresponsive recall paths, while a cache‑based compensation mechanism reuses previous results to minimize loss.

Recall level classification controls which calls may be dropped.

Drop‑layer stops waiting for unresponsive paths.

Cache‑backed compensation reduces the impact of discarded recalls.

Ranking Layer Fault Handling

The ranking service sits after recall and uses a coarse‑ranking + fine‑ranking two‑stage funnel. When coarse or fine ranking fails at large scale, Baidu falls back to a stable proxy router and uses offline‑derived point‑wise scores or the coarse‑ranking model as an emergency sorter.

Introduce a stable middle‑proxy router for quick failover.

Coarse‑ranking fallback uses cached point‑wise scores.

Fine‑ranking fallback directly switches to the coarse‑ranking model.

IDC‑Level Fault Solutions

For data‑center outages, Baidu adopts an active‑active multi‑master architecture for the delivery‑history storage service. Each IDC maintains a full copy; reads are served locally, while writes are synchronously replicated across IDC and, on failure, queued for asynchronous replication.

Each region holds a complete data replica.

Read requests are confined to the local IDC.

Write requests are synchronously replicated; failures fall back to a message‑queue‑driven async replication.

Conclusion

Through a combination of flexible retry control, real‑time stop‑loss, multi‑recall scheduling, ranking fallbacks, and IDC‑wide multi‑master replication, Baidu's Feed recommendation system consistently achieves five‑nine availability and can gracefully handle both localized and large‑scale failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

distributed systems Microservices Operations recommendation system High Availability fault tolerance

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.