How Alibaba’s Dynamic Compute Transforms Ad Engine Efficiency
This article details Alibaba Mama’s dynamic compute system—its architecture, offline and online tidal‑compute mechanisms, city‑level mutual backup, RT control, large‑scale promotion handling, metric integration, and recent infrastructure upgrades—showcasing concrete performance gains and future challenges in green, intelligent ad‑engine resource management.
Overview
Dynamic compute is a runtime system that continuously optimizes CPU allocation for Alibaba Mama’s ad engine (≈1 M cores). It adjusts tier levels of containers based on real‑time metrics, reduces idle resources, and improves latency.
Application Layer
Daily Tidal Compute – Offline
In the recall stage, offline (full‑label) and online (incremental) pipelines share the same pool of machines. Dynamic compute periodically (≈10 s) changes the proportion of offline machines that are repurposed for online tasks. This saved ~20 % of cores (tens of thousands) and reduced end‑to‑end latency by ~2 ms. The spike at 06:00 is caused by a scheduled data‑cut operation.
Daily Tidal Compute – Online
Four core online links (recall, ranking, policy, creative) are integrated with dynamic compute. When system load is low (night), idle CPU is used to raise tier levels, increasing capacity without adding hardware. An experiment on a small traffic slice showed a 20 % tier increase with cost +0.8 %, PV +0.6 %, RPM +0.2 %.
City‑Level Mutual Backup
To survive a data‑center failure, traffic is shifted to another city while keeping front‑end timeout increase < 1 %. The procedure includes:
Catalog all traffic scenarios and ensure each core service can be switched.
Run single‑center cut‑over drills and tune control policies for fast convergence.
Execute city‑level mutual‑backup drills and generate loss‑impact reports.
During a failover, tidal compute is disabled, offline machines are reassigned to online, and a global RT controller lowers tiers of low‑margin modules. Tier convergence completes within 3 minutes; cost drop is < 7 % for the worst‑case center and < 5 % for others.
RT Control
Different front‑end response‑time (RT) budgets (e.g., 200 ms vs 300 ms) require dynamic tier reduction when traffic spikes raise system water‑level. The global RT controller prioritises low‑margin modules for tier reduction until the target RT is met. In the Double‑11 promotion, timeout rate dropped from > 20 % to ~1 % after enabling RT control.
Large‑Scale Promotion – Peak Pre‑Adjustment & Fast Recovery
At 19:59 a predefined high tier (10× capacity) is activated; at 20:01 automatic regulation starts. All tiers converge within 3 minutes as traffic falls, keeping CPU usage stable and latency unchanged.
Common Layer
Control Strategies
Dynamic Container Group Regulator (via OOPS HTTP API) supports multi‑objective adjustments with configurable convergence intervals. Multi‑Goal Negative‑Feedback Regulator enables per‑data‑center goal settings.
Metric Input & Processing
Configurable ingestion supports GoldEye, Blink, TPP, EADS KMON, and One‑Engine Khronos indicators. Collection frequencies and aggregation strategies (mean, max‑mean, etc.) are configurable per metric.
Flexible Scheduling
Second‑level scheduling allows different max tiers, fixed tiers, control targets, and templates to be applied per second, handling scenarios such as midnight search‑depth surges or hourly traffic spikes.
Infrastructure Layer
Management Upgrades
Added diff‑based release comparison and templated control policies.
View Upgrades
Tier‑detail view, historical tier curves, and snapshot capability for recording tier data.
Integration Model
Architecture
Dynamic compute consists of a client SDK (C++/Java) and a server (controller, management UI). The client contacts the controller to obtain the optimal tier for each control point and can apply custom policies (e.g., DC‑AF, Q‑Score, user‑defined).
EADS Integration
The agent is merged into the EADS framework. Configuration files are eliminated; environment variables render settings automatically. APPID is shared across services, and inter‑service parameter passing is supported. Parameter acquisition code reduced from eight lines to one, and flow‑control/RT‑control are now configuration‑driven.
Old version required separate APPID per application; new version shares APPID.
Old version lacked inter‑service parameter passing; new version supports it.
Old flow‑control needed six lines of code; new version is configuration‑driven.
Conclusion
Dynamic compute abstracts common compute‑allocation functions and provides a lightweight integration experience. Achieving truly global optimal allocation across all scenarios remains an open challenge. Future work will explore tighter coupling between compute, capacity, and business effect, more advanced decision algorithms, and continued usability and reliability improvements.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
