Stability Assurance for Baidu Search Aladdin during Large-Scale Events
Baidu’s Aladdin search service safeguards stability during massive traffic spikes—such as Gaokao, the Tokyo and Beijing Olympics—by mapping dependencies, deploying multi‑dimensional monitoring, adding scaling layers like multi‑region Redis, and establishing rapid‑response on‑call teams, achieving over 99.99 % uptime and near‑real‑time data updates.
In the fast‑changing internet industry, services accumulate historical baggage. Baidu’s vertical search product Aladdin has been handling high‑traffic events such as the Chinese college entrance exam (Gaokao), the Tokyo Olympics and the Beijing Winter Olympics.
Since 2013 Baidu has been serving Gaokao queries, now reaching billions of page views. The accumulated system complexity creates stability risks, especially during peak traffic of large events.
Guarantee approach
1. Fault discovery – Build a complete business model, map upstream/downstream dependencies, assess risk, and set up comprehensive logging and monitoring. For each hotspot event, identify unique dependency chains.
Key checklist (excerpt):
Notify upstream/downstream owners of event timing and impact.
Identify functional points used by the event and monitor traffic trends.
Estimate peak QPS for the event and prepare scaling or degradation plans.
Verify code robustness of core functions and prepare risk/rollback plans.
Instrument logs for rapid fault localization.
2. Fault control – Construct multi‑dimensional monitoring (function, business, latency, data freshness) across regions and data centers. Reduce data latency, set real‑time alerts, and ensure end‑to‑end consistency with official data.
During events, isolate faults at business, service, and storage layers, and reinforce core modules. For example, Olympic traffic was expected to reach tens of thousands QPS, exceeding typical peaks; a multi‑region Redis cache layer was added to absorb the load.
3. Fault handling – Establish a rapid‑response on‑call group, coordinate with operations, and enable automatic instance shutdown, traffic shaping, and selective feature degradation. Prepare pre‑defined intervention plans for data timeliness, and conduct fault‑injection drills to validate response procedures.
Through these measures, Aladdin maintained >99.99 % stability and near‑real‑time data updates during the major events.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.