Artificial Intelligence 32 min read

How to Build High‑Availability AI Agents: Challenges, Strategies, and Real‑World Insights

This article explores the evolving concept of AI agents, debates their definitions, outlines four major deployment challenges—including prompt instability, planning balance, domain knowledge integration, and response speed—and presents practical strategies such as prompt engineering, workflow design, multi‑agent architectures, and model optimization to build reliable, high‑availability agents.

Alibaba Cloud Developer

Jun 20, 2025

How to Build High‑Availability AI Agents: Challenges, Strategies, and Real‑World Insights

Background

For more than a year I have been working with agents, publishing several reflective articles ranging from early experiments integrating agent capabilities into Alibaba Cloud customer service bots to deep explorations of popular agent products like Manus.

Controversies Around the Agent Concept

Agents have become increasingly popular across industries, leading to diverse opinions on their definition. The main disputes focus on three aspects:

Intelligent Agent vs. Proxy : Is an agent required to be "intelligent" and driven by large language models (LLMs), or can any program that performs a delegated task be called an agent?

Autonomous Planning vs. Workflow : Must an agent be capable of self‑planning, or can a predefined workflow that orchestrates tasks also qualify as an agent?

Function Calls vs. Role‑Playing : Does an agent need to support function calls, or is a prompt‑driven persona sufficient?

These debates stem from the original English meaning of "agent" as a representative, but in Chinese the term has split into "智能体" (intelligent entity) and "代理" (proxy). Both interpretations coexist, reflecting the technology’s rapid evolution.

Broad vs. Narrow Definitions

Narrow definition : An ideal agent is an intelligent entity that understands natural language, autonomously plans, decomposes, executes, reflects, and decides without human intervention.

Broad definition : An agent is any automated task executor, regardless of whether it uses LLMs, smaller models, hard‑coded rules, or a combination of techniques.

Challenges in Deploying Agents

The deployment of agents faces four major challenges:

Unstable Runtime : Prompt engineering is difficult; overly short, overly long, or contradictory prompts lead to inconsistent behavior.

Planning Balance : Pure LLM‑driven planning offers high intelligence but low controllability, while workflow‑based orchestration provides stability at the cost of flexibility.

Domain Knowledge Integration : General‑purpose models lack specialized domain knowledge, making it hard to handle industry‑specific terminology and processes.

Response Speed : Large models are slow, while smaller models may lack accuracy, creating a trade‑off between latency and performance.

Mitigation Strategies

Prompt Optimization : Use well‑structured templates, AI‑assisted prompt generation, and iterative refinement to eliminate ambiguity, length issues, and conflicts.

Workflow Design : Apply workflows for standard, repeatable tasks (e.g., order‑finance checks) and reserve LLM autonomous planning for exploratory, complex scenarios (e.g., RDS anomaly diagnosis).

Multi‑Agent Architecture : Combine stable, controllable workflows with outer LLM decision‑making to achieve "internal stability + external flexibility" or the reverse, depending on the use case.

Domain Data Integration : Dynamically inject domain priors in prompts, call external tools or knowledge bases, or fine‑tune domain‑specific models when necessary.

Speed Optimization : Convert non‑essential LLM steps to code or scripts, employ inference acceleration techniques (quantization, KV‑cache optimization, frameworks like FlashAttention or vLLM), and use smaller models distilled from larger teachers for high‑frequency function calls.

Evolution of Workflow Approaches

Three workflow generations were explored:

Natural‑Language‑to‑DAG : Users describe processes in plain language, the LLM generates a DAG, and executes step‑by‑step. This offers low entry cost but suffers from slow execution and lack of mid‑process jumps.

Code/Model‑Hybrid + Rule‑Driven Execution : The generated DAG is enriched with code, scripts, or rules, improving speed and stability but raising the construction complexity.

Natural‑Language + LLM Planning : A loosely defined DAG guides the LLM, which directly responds to users using the diagram as contextual reference, achieving higher flexibility and intelligence.

Path for Building and Continuously Tuning Agents

1. Prototype with Prompts : Build a minimal demo using prompt engineering, iterate to achieve stable behavior.

2. Introduce Structured Workflow : If instability persists, decompose the task into a workflow to constrain execution while preserving necessary flexibility.

3. Adopt Multi‑Agent Architecture : For scenarios requiring both control and autonomy, orchestrate multiple specialized agents.

4. Custom Model Training : When performance, speed, and reliability demands exceed what prompt and workflow tuning can provide, collect domain data, fine‑tune or train a dedicated model, and evaluate with metrics such as tool‑selection accuracy and action execution correctness.

Conclusion and Outlook

Over more than a year of building agents in Alibaba Cloud services, we have experimented with many approaches, encountered numerous pitfalls, and distilled a set of practical methodologies. While each business scenario may require a different balance of intelligence, controllability, and cost, the presented frameworks and optimization techniques aim to help practitioners accelerate agent adoption and achieve higher reliability.

"Agenticness is a spectrum; systems can exhibit varying degrees of autonomy, from almost none to highly autonomous, and all are valid as long as they meet the required level of agency." – Andrew Ng

References

Andrew Ng, "Agent进入工程时代！吴恩达详解 AI Agent 构建全流程，核心不在模型，而是任务拆解与评估机制"

Anthropic, "Building Effective Agents", https://www.anthropic.com/engineering/building-effective-agents

LangChain, "Workflows and Agents", https://langchain-ai.github.io/langgraph/tutorials/workflows/

LangChain Blog, "How To Think About Agent Frameworks", https://blog.langchain.dev/how-to-think-about-agent-frameworks/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM workflow AI Agent Multi-Agent agentic systems

Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.