Industry Insights 16 min read

How Alibaba’s Dynamic Compute Transforms Ad Engine Efficiency

This article details Alibaba Mama’s dynamic compute system—its architecture, offline and online tidal‑compute mechanisms, city‑level mutual backup, RT control, large‑scale promotion handling, metric integration, and recent infrastructure upgrades—showcasing concrete performance gains and future challenges in green, intelligent ad‑engine resource management.

Alimama Tech
Alimama Tech
Alimama Tech
How Alibaba’s Dynamic Compute Transforms Ad Engine Efficiency

Overview

Dynamic compute is a runtime system that continuously optimizes CPU allocation for Alibaba Mama’s ad engine (≈1 M cores). It adjusts tier levels of containers based on real‑time metrics, reduces idle resources, and improves latency.

Application Layer

Daily Tidal Compute – Offline

In the recall stage, offline (full‑label) and online (incremental) pipelines share the same pool of machines. Dynamic compute periodically (≈10 s) changes the proportion of offline machines that are repurposed for online tasks. This saved ~20 % of cores (tens of thousands) and reduced end‑to‑end latency by ~2 ms. The spike at 06:00 is caused by a scheduled data‑cut operation.

Daily Tidal Compute – Online

Four core online links (recall, ranking, policy, creative) are integrated with dynamic compute. When system load is low (night), idle CPU is used to raise tier levels, increasing capacity without adding hardware. An experiment on a small traffic slice showed a 20 % tier increase with cost +0.8 %, PV +0.6 %, RPM +0.2 %.

City‑Level Mutual Backup

To survive a data‑center failure, traffic is shifted to another city while keeping front‑end timeout increase < 1 %. The procedure includes:

Catalog all traffic scenarios and ensure each core service can be switched.

Run single‑center cut‑over drills and tune control policies for fast convergence.

Execute city‑level mutual‑backup drills and generate loss‑impact reports.

During a failover, tidal compute is disabled, offline machines are reassigned to online, and a global RT controller lowers tiers of low‑margin modules. Tier convergence completes within 3 minutes; cost drop is < 7 % for the worst‑case center and < 5 % for others.

RT Control

Different front‑end response‑time (RT) budgets (e.g., 200 ms vs 300 ms) require dynamic tier reduction when traffic spikes raise system water‑level. The global RT controller prioritises low‑margin modules for tier reduction until the target RT is met. In the Double‑11 promotion, timeout rate dropped from > 20 % to ~1 % after enabling RT control.

Large‑Scale Promotion – Peak Pre‑Adjustment & Fast Recovery

At 19:59 a predefined high tier (10× capacity) is activated; at 20:01 automatic regulation starts. All tiers converge within 3 minutes as traffic falls, keeping CPU usage stable and latency unchanged.

Common Layer

Control Strategies

Dynamic Container Group Regulator (via OOPS HTTP API) supports multi‑objective adjustments with configurable convergence intervals. Multi‑Goal Negative‑Feedback Regulator enables per‑data‑center goal settings.

Metric Input & Processing

Configurable ingestion supports GoldEye, Blink, TPP, EADS KMON, and One‑Engine Khronos indicators. Collection frequencies and aggregation strategies (mean, max‑mean, etc.) are configurable per metric.

Flexible Scheduling

Second‑level scheduling allows different max tiers, fixed tiers, control targets, and templates to be applied per second, handling scenarios such as midnight search‑depth surges or hourly traffic spikes.

Infrastructure Layer

Management Upgrades

Added diff‑based release comparison and templated control policies.

View Upgrades

Tier‑detail view, historical tier curves, and snapshot capability for recording tier data.

Integration Model

Architecture

Dynamic compute consists of a client SDK (C++/Java) and a server (controller, management UI). The client contacts the controller to obtain the optimal tier for each control point and can apply custom policies (e.g., DC‑AF, Q‑Score, user‑defined).

EADS Integration

The agent is merged into the EADS framework. Configuration files are eliminated; environment variables render settings automatically. APPID is shared across services, and inter‑service parameter passing is supported. Parameter acquisition code reduced from eight lines to one, and flow‑control/RT‑control are now configuration‑driven.

Old version required separate APPID per application; new version shares APPID.

Old version lacked inter‑service parameter passing; new version supports it.

Old flow‑control needed six lines of code; new version is configuration‑driven.

Conclusion

Dynamic compute abstracts common compute‑allocation functions and provides a lightweight integration experience. Achieving truly global optimal allocation across all scenarios remains an open challenge. Future work will explore tighter coupling between compute, capacity, and business effect, more advanced decision algorithms, and continued usability and reliability improvements.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AlibabaPerformance OptimizationOperationsresource allocationdynamic computead engine
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.