How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design
The article outlines the four major obstacles hindering industry‑specific LLM agents—manual workflow maintenance, poor knowledge reuse, training‑inference inefficiency, and complex reasoning evaluation—and explains how Huawei Noah’s MindScale package tackles each with self‑evolving workflows, automated prompt optimization, and a novel KV‑Embedding cache that slashes token consumption by 5.7× while boosting inference speed up to 70%.
Four Core Challenges for Industry Agents
Huawei Noah’s Lab identifies four key barriers to deploying domain‑specific agents: (1) Manual workflow maintenance that relies on experts translating business rules; (2) Difficulty reusing historical knowledge because inference paths and feedback are not leveraged for self‑evolution; (3) Training‑inference efficiency bottlenecks caused by massive model deployments and elongated reasoning chains; and (4) Complex multi‑step, multi‑tool reasoning evaluation where single‑metric scores fail to reflect true performance.
Self‑Evolving Workflow and Prompt Automation
To address workflow maintenance, MindScale introduces EvoFabric , an algorithm that automatically evolves agent workflows. Combined with SOP2Workflow , natural‑language SOP documents and historical tool libraries are transformed into executable workflows without expert hand‑coding.
The underlying graph‑engine supports mixed nodes (agents, tools, memory), state rewriting, and DSL import/export, enabling rapid copying, migration, and deployment of complex intelligent processes.
Memory‑Based Evolution and Prompt Optimization
During multi‑round execution, a memory module records trajectory information and evaluation results, forming an experience‑driven context that improves agent performance over time.
Prompt optimization builds on the previously released SCOPE algorithm, allowing online refinement of prompts between inference steps. This yields more than a 20% accuracy gain in HLE and GAIA reasoning scenarios.
The newly proposed C‑MOP optimizer introduces a bidirectional sample‑aware and temporal‑momentum gradient strategy, resolving “text‑gradient” conflicts and achieving a closed‑loop “feedback → evolution” prompt improvement.
Inference Efficiency and KV‑Embedding Cache
MindScale’s TrimR module employs a lightweight, pre‑trained validator to detect and truncate useless intermediate reasoning, eliminating the need for additional model fine‑tuning. Benchmarks on MATH, AIME, GPQA and various LRM models show up to a 70% reduction in inference latency under high‑concurrency workloads.
Beyond traditional KV‑Cache acceleration, the package introduces KV‑Embeddings , treating KV‑Cache as a free lightweight representation (Chain‑of‑Embedding). This approach maintains or surpasses performance of dedicated embedding models while reducing generated token count by 5.7×, effectively turning the cache into a reusable “thinking memory.”
Hardware Adaptation and Open Resources
MindScale also provides Ascend‑compatible code, enabling developers to build high‑precision, high‑efficiency agents on domestic hardware. All algorithmic components, papers, and code are publicly available via the MindScale homepage (https://noah‑mindscale.github.io/) and the Noah Lab site (https://www.noahlab.com.hk/#/home).
Machine Learning Algorithms & Natural Language Processing
Focused on frontier AI technologies, empowering AI researchers' progress.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
