Boost AI Model Performance: Master Host‑Device Scheduling on Ascend Platforms

This article explains how CPUs and Ascend AI processors cooperate as host and device, compares sink and host scheduling modes, defines Host‑Bound and Device‑Bound models, and presents optimization techniques such as tiling cache, multi‑core concurrency, and small‑shape operator handling that dramatically improve AI model execution efficiency.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Boost AI Model Performance: Master Host‑Device Scheduling on Ascend Platforms

Host and Device Collaboration in AI Model Execution

During AI model runtime, the CPU (host) and a dedicated AI processor such as the Ascend NPU (device) work together. The host handles complex logical calculations, while the device excels at high‑parallel computation. Efficient scheduling between host and device is essential for maximizing performance and resource utilization.

01 GE Scheduling Modes

The Graph Engine (GE) is the control center for compiling and running computation graphs on Ascend platforms. It provides two common scheduling modes:

Sink Scheduling – Suited for static‑shape models. Memory layout and tiling are determined at compile time, allowing the entire graph to be dispatched to the device as a single task, minimizing host overhead.

Host Scheduling – Suited for dynamic‑shape models. Since input tensor shapes vary, each operator’s InferShape, tiling, and memory allocation are performed at runtime, requiring tighter host‑device coordination.

For a deeper dive, see the referenced QR‑code article “In‑Depth Analysis of Ascend CANN Model Sink Technology”.

02 Host‑Bound and Device‑Bound Models

In dynamic‑shape scenarios, host and device execution proceeds asynchronously. If the host dispatches operators faster than the device can execute them, the device never idles, and the model is Device‑Bound . Conversely, if the device finishes its tasks before the host can dispatch the next operator, the device waits, making the model Host‑Bound . Optimizing the slower side improves end‑to‑end performance.

Illustration of operator execution on host and device:

Host‑Device execution diagram
Host‑Device execution diagram

03 Host Scheduling Optimization Techniques

Host Cache (Tiling Cache)

By caching the results of expensive tiling calculations, the host can reuse them when identical parameters appear, reducing tiling overhead from tens of microseconds to near‑zero. The cache distinguishes compile‑time parameters (fixed) from runtime parameters (shape‑dependent) and generates a hash key for lookup.

Tiling cache diagram
Tiling cache diagram

Enabling this cache cuts host‑side tiling cost by over 50 % on large language models such as Pangu and LLaMA2.

Host CPU Multi‑Core Concurrency

GE splits the dispatch process into three pipeline stages:

Stage 1 (Normal Thread) – Executes host kernels that do not involve device memory or launch, e.g., InferShape and tiling.

Stage 2 (Memory Thread) – Handles device memory allocation and release.

Stage 3 (Launch Thread) – Launches AI Core/AI CPU kernels on the device.

Parallel execution of these stages maximizes CPU utilization and hides lower‑stage latency.

Small‑Shape Operator Optimization

For operators with tiny input tensors, the host‑side computation cost is microseconds, often exceeding the dispatch overhead. Keeping such operators on the host avoids unnecessary device dispatch, reducing end‑to‑end latency.

Small‑shape operator flow
Small‑shape operator flow

Applying this technique to LLaMA2 saved ~5 % throughput by retaining ~650 small‑shape ops on the host.

Other Techniques

Additional host‑side optimizations include operator fusion, which merges adjacent operators into a single, more efficient kernel, further reducing host kernel count and dispatch overhead.

Practical Configuration

Enable multi‑pipeline scheduling by setting the environment variable MAX_RUNTIME_CORE_NUMBER=3: export MAX_RUNTIME_CORE_NUMBER=3 Performance tests on typical large models show noticeable reductions in host scheduling time and overall latency.

Performance improvement chart
Performance improvement chart
performance optimizationAIdynamic shapehost-deviceModel Scheduling
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.