Building High‑Availability Architecture for Baidu Feed Online Recommendation System
This article describes how Baidu engineered a flexible, multi‑level fault‑tolerant architecture—including dynamic retry scheduling, multi‑recall coordination, ranking layer degradation, and cross‑IDC multi‑master storage—to achieve five‑nine availability for its massive feed recommendation service.
Baidu Feed's information‑flow recommendation system powers most of the company's products, handling billions of requests backed by hundreds of microservices and thousands of machines; ensuring high availability is a core architectural goal.
To meet the 5‑nine availability target, the team designed a flexible, multi‑level fault‑handling framework that addresses instance‑level, service‑level, and IDC‑level failures.
Instance‑level solution: A dynamic retry scheduling mechanism controls retry traffic (e.g., limiting to 3% of requests) using real‑time latency quantiles, while a real‑time stop‑loss component adjusts instance weights based on availability and latency feedback, reducing outage impact within seconds.
Service‑level solution: A multi‑recall scheduling framework classifies recalls into three tiers, applies a discard mechanism for unresponsive calls, and uses a cache‑based compensation strategy to mitigate loss; the ranking layer employs coarse‑ and fine‑ranking with a stable router and fallback paths to gracefully degrade when ranking services fail.
IDC‑level solution: An active‑active multi‑master storage architecture keeps a full copy of data in each region, enforces local reads, and performs cross‑IDC asynchronous writes (with a fallback message queue), enabling rapid traffic shifting during regional outages.
The combined architecture has improved normal‑operation availability by over 90%, consistently meeting the five‑nine target and providing robust resilience against large‑scale faults.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
