Building High‑Availability Architecture for Baidu Feed Online Recommendation System

This article describes how Baidu engineered a flexible, multi‑level fault‑tolerant architecture—including dynamic retry scheduling, multi‑recall coordination, ranking layer degradation, and cross‑IDC multi‑master storage—to achieve five‑nine availability for its massive feed recommendation service.

Baidu Intelligent Testing
Baidu Intelligent Testing
Baidu Intelligent Testing
Building High‑Availability Architecture for Baidu Feed Online Recommendation System

Ba​idu Feed's information‑flow recommendation system powers most of the company's products, handling billions of requests backed by hundreds of microservices and thousands of machines; ensuring high availability is a core architectural goal.

To meet the 5‑nine availability target, the team designed a flexible, multi‑level fault‑handling framework that addresses instance‑level, service‑level, and IDC‑level failures.

Instance‑level solution: A dynamic retry scheduling mechanism controls retry traffic (e.g., limiting to 3% of requests) using real‑time latency quantiles, while a real‑time stop‑loss component adjusts instance weights based on availability and latency feedback, reducing outage impact within seconds.

Service‑level solution: A multi‑recall scheduling framework classifies recalls into three tiers, applies a discard mechanism for unresponsive calls, and uses a cache‑based compensation strategy to mitigate loss; the ranking layer employs coarse‑ and fine‑ranking with a stable router and fallback paths to gracefully degrade when ranking services fail.

IDC‑level solution: An active‑active multi‑master storage architecture keeps a full copy of data in each region, enforces local reads, and performs cross‑IDC asynchronous writes (with a fallback message queue), enabling rapid traffic shifting during regional outages.

The combined architecture has improved normal‑operation availability by over 90%, consistently meeting the five‑nine target and providing robust resilience against large‑scale faults.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

cloud nativerecommendation systemhigh availabilityfault tolerancedynamic retry
Baidu Intelligent Testing
Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.