How Baidu Feed Scaled to Serverless with Multi‑Dimensional Service Profiles
This article explains how Baidu Feed’s backend services were transformed to a serverless model by building elastic, traffic, and capacity profiles for each service, enabling predictive, load‑feedback, and timed scaling strategies that automatically adjust resources with traffic fluctuations, reduce costs, and maintain stability.
Background
In Baidu’s cloud‑native environment, the Feed recommendation system runs thousands of compute‑intensive micro‑services 24/7 with statically provisioned capacity. Traffic exhibits clear tidal patterns, causing resource waste during troughs and insufficient capacity during peaks.
Goal
Provide serverless‑style elasticity for heavy backend services by constructing multi‑dimensional service profiles (elastic, traffic, capacity) that drive dynamic capacity adjustments.
Service Profiling
Elastic Profile : Classifies services as high, medium, or low elasticity based on instance deployment time, resource quota, statefulness, and external dependencies.
Traffic Profile : Uses historical CPU usage (as a proxy for QPS) aggregated in configurable time‑slices (e.g., hourly). Data are smoothed with median‑absolute‑deviation filtering and the maximum values of recent windows are kept as the traffic estimate for each slice.
Capacity Profile : Derives the required CPU buffer from observed peak CPU utilization and maps it to acceptable latency thresholds for core and non‑core services.
Elasticity Strategies
Predictive Elasticity : Forecasts traffic for the next time‑slice and pre‑emptively scales up or down. Four traffic‑trend cases are defined (rising, turning‑up, peak, falling) and scaling actions are derived from the case.
Load‑Feedback Elasticity : Continuously monitors real‑time CPU and custom metrics (e.g., latency) and adjusts instance counts to keep load within target ranges. Scaling‑up is performed immediately; scaling‑down is delegated to the other strategies.
Timed Elasticity : Executes fixed scaling actions before known peak periods and after off‑peak periods based on the maximum traffic observed in each phase.
Priority order: timed > predictive > load‑feedback.
Stability Guarantees
Elastic inspections periodically trigger instance migrations to validate scaling capability.
Capacity inspections monitor resource usage and raise alerts when limits are approached.
Status inspections verify service state consistency across scaling cycles.
One‑click interventions provide rapid rollback or emergency actions.
Implementation Highlights
Standardized container migration and compute‑storage separation reduce dictionary download and extraction time, improving instance startup latency.
Shared cloud disks enable on‑demand loading of large dictionary files, further cutting deployment time.
Target instance counts are bounded by configured upper/lower limits and step‑size constraints to avoid over‑scaling or abrupt shrinkage.
Architecture Overview
The overall elasticity architecture consists of service profiling, elastic strategy engines, cloud‑native components (PaaS for scaling actions, ALM for data and policy management), and resource pools (private and public clouds). The diagram below illustrates the data flow.
Predictive Elasticity Details
For each service, the previous, current, and next time‑slice traffic values (prev, cur, next) are obtained from the maximum traffic of the past N days. The four cases are:
prev < cur < next – continuous rise → pre‑scale to next.
prev > cur < next – valley turning up → pre‑scale to next.
prev < cur > next – peak → no action.
prev > cur > next – falling trend → scale down to cur.
Target capacity is the larger of the case‑based target traffic and a growth‑rate‑based estimate ( cur × maxGrowthRate). The target instance count is computed from the capacity profile and applied via PaaS.
Load‑Feedback Elasticity Details
Metrics collected every 10 s (CPU usage, custom Prometheus metrics) are aggregated in a sliding window (e.g., 1 min) and filtered with median‑absolute‑deviation to remove outliers. The current load is compared against the capacity profile’s CPU buffer; if the load exceeds the upper threshold, instances are added, respecting step‑size limits. Scaling‑down is omitted to avoid conflict with predictive actions.
Timed Elasticity Details
Peak and off‑peak periods are defined per service based on historical traffic slices. The maximum traffic within each period determines the target capacity. Scaling actions are scheduled to expand capacity shortly before a peak starts and shrink it after the peak ends.
Capacity Modeling
Peak CPU utilization is used as a proxy for required capacity. For core services, a larger CPU buffer is kept to guarantee latency; non‑core services tolerate smaller buffers. Machine‑learning models map QPS and resource usage to latency ( f(qps, X)=latency) to compute the optimal CPU buffer for each service.
Results
Deploying the serverless elasticity framework across Baidu Feed scaled the system to over 100 000 service instances while significantly reducing operational costs.
Future Work
Focus on capacity assurance for hotspot events and applying machine‑learning techniques to improve traffic‑profile prediction accuracy.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
