Industry Insights 19 min read

How Baidu Feed Scaled to Serverless with Multi‑Dimensional Service Profiles

This article explains how Baidu Feed’s backend services were transformed to a serverless model by building elastic, traffic, and capacity profiles for each service, enabling predictive, load‑feedback, and timed scaling strategies that automatically adjust resources with traffic fluctuations, reduce costs, and maintain stability.

Baidu Geek Talk

Mar 15, 2023

How Baidu Feed Scaled to Serverless with Multi‑Dimensional Service Profiles

Background

In Baidu’s cloud‑native environment, the Feed recommendation system runs thousands of compute‑intensive micro‑services 24/7 with statically provisioned capacity. Traffic exhibits clear tidal patterns, causing resource waste during troughs and insufficient capacity during peaks.

Goal

Provide serverless‑style elasticity for heavy backend services by constructing multi‑dimensional service profiles (elastic, traffic, capacity) that drive dynamic capacity adjustments.

Service Profiling

Elastic Profile : Classifies services as high, medium, or low elasticity based on instance deployment time, resource quota, statefulness, and external dependencies.

Traffic Profile : Uses historical CPU usage (as a proxy for QPS) aggregated in configurable time‑slices (e.g., hourly). Data are smoothed with median‑absolute‑deviation filtering and the maximum values of recent windows are kept as the traffic estimate for each slice.

Capacity Profile : Derives the required CPU buffer from observed peak CPU utilization and maps it to acceptable latency thresholds for core and non‑core services.

Elasticity Strategies

Predictive Elasticity : Forecasts traffic for the next time‑slice and pre‑emptively scales up or down. Four traffic‑trend cases are defined (rising, turning‑up, peak, falling) and scaling actions are derived from the case.

Load‑Feedback Elasticity : Continuously monitors real‑time CPU and custom metrics (e.g., latency) and adjusts instance counts to keep load within target ranges. Scaling‑up is performed immediately; scaling‑down is delegated to the other strategies.

Timed Elasticity : Executes fixed scaling actions before known peak periods and after off‑peak periods based on the maximum traffic observed in each phase.

Priority order: timed > predictive > load‑feedback.

Stability Guarantees

Elastic inspections periodically trigger instance migrations to validate scaling capability.

Capacity inspections monitor resource usage and raise alerts when limits are approached.

Status inspections verify service state consistency across scaling cycles.

One‑click interventions provide rapid rollback or emergency actions.

Implementation Highlights

Standardized container migration and compute‑storage separation reduce dictionary download and extraction time, improving instance startup latency.

Shared cloud disks enable on‑demand loading of large dictionary files, further cutting deployment time.

Target instance counts are bounded by configured upper/lower limits and step‑size constraints to avoid over‑scaling or abrupt shrinkage.

Architecture Overview

The overall elasticity architecture consists of service profiling, elastic strategy engines, cloud‑native components (PaaS for scaling actions, ALM for data and policy management), and resource pools (private and public clouds). The diagram below illustrates the data flow.

Predictive Elasticity Details

For each service, the previous, current, and next time‑slice traffic values (prev, cur, next) are obtained from the maximum traffic of the past N days. The four cases are:

prev < cur < next – continuous rise → pre‑scale to next.

prev > cur < next – valley turning up → pre‑scale to next.

prev < cur > next – peak → no action.

prev > cur > next – falling trend → scale down to cur.

Target capacity is the larger of the case‑based target traffic and a growth‑rate‑based estimate ( cur × maxGrowthRate). The target instance count is computed from the capacity profile and applied via PaaS.

Load‑Feedback Elasticity Details

Metrics collected every 10 s (CPU usage, custom Prometheus metrics) are aggregated in a sliding window (e.g., 1 min) and filtered with median‑absolute‑deviation to remove outliers. The current load is compared against the capacity profile’s CPU buffer; if the load exceeds the upper threshold, instances are added, respecting step‑size limits. Scaling‑down is omitted to avoid conflict with predictive actions.

Timed Elasticity Details

Peak and off‑peak periods are defined per service based on historical traffic slices. The maximum traffic within each period determines the target capacity. Scaling actions are scheduled to expand capacity shortly before a peak starts and shrink it after the peak ends.

Capacity Modeling

Peak CPU utilization is used as a proxy for required capacity. For core services, a larger CPU buffer is kept to guarantee latency; non‑core services tolerate smaller buffers. Machine‑learning models map QPS and resource usage to latency ( f(qps, X)=latency) to compute the optimal CPU buffer for each service.

Results

Deploying the serverless elasticity framework across Baidu Feed scaled the system to over 100 000 service instances while significantly reducing operational costs.

Future Work

Focus on capacity assurance for hotspot events and applying machine‑learning techniques to improve traffic‑profile prediction accuracy.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Serverless Elastic Scaling capacity management Service Profiling

Written by

Baidu Geek Talk

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.