Boost AI Model Performance: Master Host‑Device Scheduling on Ascend Platforms
This article explains how CPUs and Ascend AI processors cooperate as host and device, compares sink and host scheduling modes, defines Host‑Bound and Device‑Bound models, and presents optimization techniques such as tiling cache, multi‑core concurrency, and small‑shape operator handling that dramatically improve AI model execution efficiency.
Host and Device Collaboration in AI Model Execution
During AI model runtime, the CPU (host) and a dedicated AI processor such as the Ascend NPU (device) work together. The host handles complex logical calculations, while the device excels at high‑parallel computation. Efficient scheduling between host and device is essential for maximizing performance and resource utilization.
01 GE Scheduling Modes
The Graph Engine (GE) is the control center for compiling and running computation graphs on Ascend platforms. It provides two common scheduling modes:
Sink Scheduling – Suited for static‑shape models. Memory layout and tiling are determined at compile time, allowing the entire graph to be dispatched to the device as a single task, minimizing host overhead.
Host Scheduling – Suited for dynamic‑shape models. Since input tensor shapes vary, each operator’s InferShape, tiling, and memory allocation are performed at runtime, requiring tighter host‑device coordination.
For a deeper dive, see the referenced QR‑code article “In‑Depth Analysis of Ascend CANN Model Sink Technology”.
02 Host‑Bound and Device‑Bound Models
In dynamic‑shape scenarios, host and device execution proceeds asynchronously. If the host dispatches operators faster than the device can execute them, the device never idles, and the model is Device‑Bound . Conversely, if the device finishes its tasks before the host can dispatch the next operator, the device waits, making the model Host‑Bound . Optimizing the slower side improves end‑to‑end performance.
Illustration of operator execution on host and device:
03 Host Scheduling Optimization Techniques
Host Cache (Tiling Cache)
By caching the results of expensive tiling calculations, the host can reuse them when identical parameters appear, reducing tiling overhead from tens of microseconds to near‑zero. The cache distinguishes compile‑time parameters (fixed) from runtime parameters (shape‑dependent) and generates a hash key for lookup.
Enabling this cache cuts host‑side tiling cost by over 50 % on large language models such as Pangu and LLaMA2.
Host CPU Multi‑Core Concurrency
GE splits the dispatch process into three pipeline stages:
Stage 1 (Normal Thread) – Executes host kernels that do not involve device memory or launch, e.g., InferShape and tiling.
Stage 2 (Memory Thread) – Handles device memory allocation and release.
Stage 3 (Launch Thread) – Launches AI Core/AI CPU kernels on the device.
Parallel execution of these stages maximizes CPU utilization and hides lower‑stage latency.
Small‑Shape Operator Optimization
For operators with tiny input tensors, the host‑side computation cost is microseconds, often exceeding the dispatch overhead. Keeping such operators on the host avoids unnecessary device dispatch, reducing end‑to‑end latency.
Applying this technique to LLaMA2 saved ~5 % throughput by retaining ~650 small‑shape ops on the host.
Other Techniques
Additional host‑side optimizations include operator fusion, which merges adjacent operators into a single, more efficient kernel, further reducing host kernel count and dispatch overhead.
Practical Configuration
Enable multi‑pipeline scheduling by setting the environment variable MAX_RUNTIME_CORE_NUMBER=3: export MAX_RUNTIME_CORE_NUMBER=3 Performance tests on typical large models show noticeable reductions in host scheduling time and overall latency.
Huawei Cloud Developer Alliance
The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
