Artificial Intelligence 64 min read

Why LLMs Behave Unpredictably: From Uncertainty to Practical Agent Design

This article analyzes the sources of LLM output uncertainty, explores hardware and architectural constraints, demonstrates how to build robust AI agents with prompt engineering, tool orchestration, and memory management, and compares traditional micro‑service design with modern LLM‑centric workflows.

Alibaba Cloud Developer

Mar 24, 2026

Why LLMs Behave Unpredictably: From Uncertainty to Practical Agent Design

1. Introduction

The author reflects on personal doubts about AI, noting early skepticism that AI was a bubble and observing a gap between corporate hype (e.g., NVIDIA stock surge) and actual investment in pre‑training large models.

2. Sources of LLM Uncertainty

Even when temperature is set to 0 and top‑k to 1, fixed inputs can produce different outputs because of several factors:

Floating‑point precision limits (FP16, BF16) cause small numerical differences that amplify during repeated matrix multiplications and softmax.

Hardware heterogeneity (different GPU models, batch sizes) leads to slight variations in computation.

Model‑level optimizations such as MoE routing and load‑balancing introduce stochastic behavior.

These effects are inherent features of current LLMs and cannot be fully eliminated.

The uncertainty should be treated as a feature that must be managed rather than removed.

3. AI Development as Engineering

AI development is framed as assembling prompts, RAG, and tool calls rather than writing deterministic code. The process includes:

Prompt Engineering : Crafting system and user prompts to guide the model.

Context Management : Compressing, summarizing, or discarding old tokens to stay within token limits.

Output Parsing : Enforcing JSON output, validating with regex, and applying fallback strategies.

These steps form a “glue layer” that translates business requirements into LLM‑understandable context and back into structured results.

4. A Minimal Agent Demo

A simple penetration‑testing agent is built without any framework. The demo shows:

nmap -p 80 -sV -Pn 127.0.0.1

nuclei -u http://127.0.0.1 -silent -nc -jsonl -tags tomcat,php,wordpress,jenkins,spring

The agent loops through a ReAct cycle (Think → Act → Observe) and updates short‑term memory, summarizing when it grows too large.

5. LangChain and Standardized Agent Architecture

LangChain provides a standardized stack: RAG → Prompt → LLM → Tool → Observation, mirroring classic control‑theory loops. Core concepts include:

ReAct : while + if logic.

Workflow : conditional branching (if/elif/else).

LangGraph : DAG‑based orchestration.

LangChain abstracts message roles (system, user, assistant, tool) and offers built‑in memory, vector stores, and tool wrappers, reducing boilerplate for agent developers.

6. Performance Perspective

Hardware constraints dominate LLM performance. Key points:

GPU compute (e.g., NVIDIA B300) provides 15 PFLOPS FP8, but data must travel through HBM, NVLink, and PCIe, each adding latency.

Memory bandwidth (HBM3E 8 TB/s) determines how fast a 70 B model can be loaded into compute cores.

PCIe is a bottleneck compared to NVLink, explaining why local inference on consumer hardware feels sluggish.

Scaling laws and MoE architectures shift computation across multiple GPUs, increasing inter‑GPU traffic.

7. Historical Evolution of Neural Networks

From the 1957 perceptron to modern Transformers, the article traces key milestones:

Perceptron and MLP (1960s‑1970s) introduced linear models and the need for non‑linear activation.

Back‑propagation (1986) enabled training of deep networks.

CNNs (1998) leveraged spatial sparsity for image tasks.

AlexNet (2012) demonstrated GPU‑accelerated deep learning.

Transformers (2017) unified attention, feed‑forward, residual, and layer‑norm blocks, leading to decoder‑only LLMs.

The article then walks through a concrete inference flow of Qwen2.5‑7B, detailing tokenization, embedding, KV‑cache usage, multi‑head attention, FFN expansion, and final token generation.

8. Practical Takeaways

When building AI agents, consider:

Defining clear boundaries for tool usage to avoid dangerous commands.

Implementing runtime evaluation and fallback mechanisms.

Monitoring latency and token cost to decide when to switch from LLM‑driven logic to deterministic code.

Using multi‑agent or ReAct patterns based on problem complexity.

Overall, the article argues that AI agents shift complexity from deterministic code to model‑driven inference, requiring new engineering disciplines around uncertainty, feedback, and hardware awareness.