How Baidu Zhidao Migrated 18 Years of Legacy to a Cloud‑Native Architecture
This article details Baidu Zhidao’s migration from an aging, monolithic PaaS platform to a cloud‑native environment, explaining the business drivers, the selection of Pandora and Zhiyun platforms, the step‑by‑step traffic‑shifting and gateway redesign, and the measurable gains in stability, scalability, and cost after achieving 100% cloud traffic.
Background and Challenges
Baidu Zhidao, an 18‑year‑old knowledge‑sharing service, faced fragmented code style, heavy legacy debt, and a PaaS‑based ORP layer that could no longer meet the stability and rapid‑iteration requirements of its >100 million daily page views. The service needed four‑nine availability while keeping operational costs low.
Cloud Solution Selection
After retiring ORP at the end of 2022, the team evaluated Kubernetes, cost, timeline, and manpower, and aligned with the company‑wide migration path. The final stack consisted of:
Pandora – the underlying container orchestration platform.
Zhiyun Platform – resource management, deployment, and service integration.
Why Pandora
Supports Baidu’s major C‑end services (search, feed, Baidu App, Baijiahao, video), making it a natural fit for Zhidao’s workload.
Can deploy up to 2 K modules simultaneously without extensive code refactoring, matching the size of the ODP monolith.
Although Pandora’s usability is slightly lower than ORP, Zhiyun supplies missing features (static resources, proxy, data delivery), keeping the overall decision unchanged.
Why Zhiyun
Enables multi‑APP co‑deployment, allowing ODP projects to migrate without merging or splitting codebases.
Provides out‑of‑the‑box services (log splitting, scheduled tasks, access layer, static resources, traffic routing) that mirror ORP capabilities while embracing cloud‑native principles.
Offers a customizable runtime environment for ODP, simplifying container deployment and operation.
Integrates container entry‑point management, reducing operational overhead.
Preparation for Traffic Shifting
Create a Zhiyun product line and application, request ECI resources, configure the ODP runtime, and set up deployment templates.
Refactor the access layer: add new BNS variables, grant DB/Redis access for new regions, and upgrade MySQL/Redis configurations.
Upgrade backend language from HHVM to PHP 7, fixing compatibility issues and gaining performance and security benefits.
Add monitoring (Noah, SIA) and adjust log collection paths.
Traffic‑Splitting Implementation
Lua scripts in the access layer define probability tables for three outcomes ("opera", "abtest", "orp"):
["strategy_1_1_98"] = {1, 1, 98},
["strategy_5_5_90"] = {5, 5, 90},
["strategy_10_10_80"] = {10, 10, 80},
["strategy_20_20_60"] = {20, 20, 60},
...,
["strategy_100_0_0"] = {100, 0, 0}The final proxy target is set with a new variable $upstream_target that encodes terminal and cluster (e.g., pc_pandora, wap_orp). Business code then routes ads based on this marker:
if ($_SERVER['HTTP_X_BD_TARGET'] == 'pandora') {
$adsEids = array('asp' => array(50001));
} else if ($_SERVER['HTTP_X_BD_TARGET'] == 'abtest') {
$adsEids = array('asp' => array(50002));
}Gateway Migration
The original inrouter gateway contained >2 600 lines of configuration and could not react quickly to downstream changes. The team switched to Janus, which is already used by other Baidu products. Benefits include:
Fine‑grained routing control with per‑service, per‑rule, per‑instance visibility.
Consolidation of 2 700+ Nginx rules into 18 clear rules, dramatically reducing maintenance risk.
Enhanced safety through staged releases, checkers, and automated test cases.
Architecture Evolution
After migration, Zhidao built a three‑region, four‑datacenter topology for its core Q&A page (QB). Traffic distribution became North China : Central China : South China = 4 : 3 : 3, providing N+1 cross‑region redundancy. Non‑core traffic was moved to two North China datacenters with intra‑region redundancy.
Results and Benefits
By 31 Mar 2023, 100 % of Zhidao traffic ran on cloud, eliminating the legacy ORP layer.
From Q3 2022 onward, four‑nine SLA was achieved for three consecutive quarters with zero cloud‑related incidents.
Core page latency dropped 12 % (FMP 80 % ≤ 1 s) without additional optimization.
Resource cost fell sharply as OXP machines were decommissioned and ECI IP costs were amortized.
Gateway rewrite reduced configuration size from 2 700 lines to 18, improving maintainability and safety.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
