How a Pure‑Software Framework Boosts On‑Device AI Agents by 1.6×
KAIST researchers introduced Agent‑X, a pure‑software acceleration framework that eliminates prefill and decode bottlenecks on mobile devices, achieving a 1.61× end‑to‑end speedup for on‑device AI agents without any loss in task accuracy.
Background and Motivation
AI assistants are moving onto phones and laptops, but running large‑model agents locally often feels painfully slow, and users also demand that data stay on the device for privacy.
Identifying the On‑Device Bottleneck
The research team analyzed the execution pipeline of on‑device agents and discovered that, unlike cloud servers where the prefill stage is trivial, both the prefill and decode stages consume comparable time on mobile hardware. Tests on the TinyAgent system (1022 real cases) showed that optimizing only the decode stage, as is common in cloud, does not improve overall latency on‑device.
Rewriting Prompts to Remove the Prefill Bottleneck
To address the prefill cost, the team applied prefix‑caching (pre‑computing static prompt prefixes). They built a component called PromptWeaver that consolidates all tool‑description texts into a single long static prefix, moving the dynamic part of the prompt to the end. Co‑activation analysis of tool usage revealed eight high‑frequency tool clusters, which were fixed‑ordered and cached as KV entries on the device’s SSD. This static prefix occupies 6.26 GB and covers 74.4 % of daily tool‑combination scenarios, reducing uncached dynamic tokens by 88.9 % and freeing the prefill stage.
Bypassing the Multi‑Token Tax with Selective Decoding
Speculative decoding, common in cloud, suffers from a “multi‑token tax” on‑device because batch verification of many tokens is slower than sequential generation. Experiments with draft models of various sizes showed that tiny models (< 100 M parameters) are inaccurate (≈2 % correct), while larger draft models (≈1 B parameters) are too slow, resulting in no net speedup.
Observing that decoded actions often copy template examples from prompts, the team replaced draft models with a lightweight component called ExSpec . ExSpec builds a tiny n‑gram lookup table from the prompt stream, using it to generate candidate tokens instantly. When the table lacks a match, the system falls back to regular token‑by‑token generation. This “selective decoding” strategy avoids unnecessary draft inference and eliminates the multi‑token tax.
Integration and Results
Agent‑X combines PromptWeaver and ExSpec into a pure‑software stack that runs on Apple M4 Pro devices. Benchmarks on high‑intensity agent tasks (calendar planning, multi‑step email replies) show a 1.61× end‑to‑end speedup, with prefill accelerated by 1.73× and decode by 1.73×, while task accuracy remains unchanged to within 0.001 %.
The framework requires no hardware upgrades or additional accelerators, demonstrating that software‑only optimizations can unlock substantial performance gains for privacy‑preserving on‑device AI.
Reference: arXiv:2605.10380; accepted at MobiSys 2026.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
SuanNi
A community for AI developers that aggregates large-model development services, models, and compute power.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
