Backend Development 16 min read

Building High‑Availability Architecture for Baidu Feed Online Recommendation System

This article describes how Baidu engineered a flexible, multi‑level fault‑tolerant architecture—including dynamic retry scheduling, multi‑recall coordination, ranking layer degradation, and cross‑IDC multi‑master storage—to achieve five‑nine availability for its massive feed recommendation service.

Baidu Intelligent Testing

Jul 29, 2021

Building High‑Availability Architecture for Baidu Feed Online Recommendation System

Baidu Feed's information‑flow recommendation system powers most of the company's products, handling billions of requests backed by hundreds of microservices and thousands of machines; ensuring high availability is a core architectural goal.

To meet the 5‑nine availability target, the team designed a flexible, multi‑level fault‑handling framework that addresses instance‑level, service‑level, and IDC‑level failures.

Instance‑level solution: A dynamic retry scheduling mechanism controls retry traffic (e.g., limiting to 3% of requests) using real‑time latency quantiles, while a real‑time stop‑loss component adjusts instance weights based on availability and latency feedback, reducing outage impact within seconds.

Service‑level solution: A multi‑recall scheduling framework classifies recalls into three tiers, applies a discard mechanism for unresponsive calls, and uses a cache‑based compensation strategy to mitigate loss; the ranking layer employs coarse‑ and fine‑ranking with a stable router and fallback paths to gracefully degrade when ranking services fail.

IDC‑level solution: An active‑active multi‑master storage architecture keeps a full copy of data in each region, enforces local reads, and performs cross‑IDC asynchronous writes (with a fallback message queue), enabling rapid traffic shifting during regional outages.

The combined architecture has improved normal‑operation availability by over 90%, consistently meeting the five‑nine target and providing robust resilience against large‑scale faults.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native recommendation system high availability fault tolerance dynamic retry

Written by

Baidu Intelligent Testing

Welcome to follow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.