Artificial Intelligence 7 min read

How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design

The article outlines the four major obstacles hindering industry‑specific LLM agents—manual workflow maintenance, poor knowledge reuse, training‑inference inefficiency, and complex reasoning evaluation—and explains how Huawei Noah’s MindScale package tackles each with self‑evolving workflows, automated prompt optimization, and a novel KV‑Embedding cache that slashes token consumption by 5.7× while boosting inference speed up to 70%.

Machine Learning Algorithms & Natural Language Processing

Feb 12, 2026

How Huawei’s MindScale Cuts Agent Token Usage 5.7× and Automates Prompt & Workflow Design

Four Core Challenges for Industry Agents

Huawei Noah’s Lab identifies four key barriers to deploying domain‑specific agents: (1) Manual workflow maintenance that relies on experts translating business rules; (2) Difficulty reusing historical knowledge because inference paths and feedback are not leveraged for self‑evolution; (3) Training‑inference efficiency bottlenecks caused by massive model deployments and elongated reasoning chains; and (4) Complex multi‑step, multi‑tool reasoning evaluation where single‑metric scores fail to reflect true performance.

Self‑Evolving Workflow and Prompt Automation

To address workflow maintenance, MindScale introduces EvoFabric , an algorithm that automatically evolves agent workflows. Combined with SOP2Workflow , natural‑language SOP documents and historical tool libraries are transformed into executable workflows without expert hand‑coding.

The underlying graph‑engine supports mixed nodes (agents, tools, memory), state rewriting, and DSL import/export, enabling rapid copying, migration, and deployment of complex intelligent processes.

Memory‑Based Evolution and Prompt Optimization

During multi‑round execution, a memory module records trajectory information and evaluation results, forming an experience‑driven context that improves agent performance over time.

Prompt optimization builds on the previously released SCOPE algorithm, allowing online refinement of prompts between inference steps. This yields more than a 20% accuracy gain in HLE and GAIA reasoning scenarios.

The newly proposed C‑MOP optimizer introduces a bidirectional sample‑aware and temporal‑momentum gradient strategy, resolving “text‑gradient” conflicts and achieving a closed‑loop “feedback → evolution” prompt improvement.

Inference Efficiency and KV‑Embedding Cache

MindScale’s TrimR module employs a lightweight, pre‑trained validator to detect and truncate useless intermediate reasoning, eliminating the need for additional model fine‑tuning. Benchmarks on MATH, AIME, GPQA and various LRM models show up to a 70% reduction in inference latency under high‑concurrency workloads.

Beyond traditional KV‑Cache acceleration, the package introduces KV‑Embeddings , treating KV‑Cache as a free lightweight representation (Chain‑of‑Embedding). This approach maintains or surpasses performance of dedicated embedding models while reducing generated token count by 5.7×, effectively turning the cache into a reusable “thinking memory.”

Hardware Adaptation and Open Resources

MindScale also provides Ascend‑compatible code, enabling developers to build high‑precision, high‑efficiency agents on domestic hardware. All algorithmic components, papers, and code are publicly available via the MindScale homepage (https://noah‑mindscale.github.io/) and the Noah Lab site (https://www.noahlab.com.hk/#/home).

Inference Acceleration Large Language Model Prompt Optimization Industry Agent KV-Embedding MindScale

Written by

Machine Learning Algorithms & Natural Language Processing

Focused on frontier AI technologies, empowering AI researchers' progress.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.