Why LLMs Behave Unpredictably: From Uncertainty to Practical Agent Design
This article analyzes the sources of LLM output uncertainty, explores hardware and architectural constraints, demonstrates how to build robust AI agents with prompt engineering, tool orchestration, and memory management, and compares traditional micro‑service design with modern LLM‑centric workflows.
1. Introduction
The author reflects on personal doubts about AI, noting early skepticism that AI was a bubble and observing a gap between corporate hype (e.g., NVIDIA stock surge) and actual investment in pre‑training large models.
2. Sources of LLM Uncertainty
Even when temperature is set to 0 and top‑k to 1, fixed inputs can produce different outputs because of several factors:
Floating‑point precision limits (FP16, BF16) cause small numerical differences that amplify during repeated matrix multiplications and softmax.
Hardware heterogeneity (different GPU models, batch sizes) leads to slight variations in computation.
Model‑level optimizations such as MoE routing and load‑balancing introduce stochastic behavior.
These effects are inherent features of current LLMs and cannot be fully eliminated.
The uncertainty should be treated as a feature that must be managed rather than removed.
3. AI Development as Engineering
AI development is framed as assembling prompts, RAG, and tool calls rather than writing deterministic code. The process includes:
Prompt Engineering : Crafting system and user prompts to guide the model.
Context Management : Compressing, summarizing, or discarding old tokens to stay within token limits.
Output Parsing : Enforcing JSON output, validating with regex, and applying fallback strategies.
These steps form a “glue layer” that translates business requirements into LLM‑understandable context and back into structured results.
4. A Minimal Agent Demo
A simple penetration‑testing agent is built without any framework. The demo shows:
nmap -p 80 -sV -Pn 127.0.0.1 nuclei -u http://127.0.0.1 -silent -nc -jsonl -tags tomcat,php,wordpress,jenkins,springThe agent loops through a ReAct cycle (Think → Act → Observe) and updates short‑term memory, summarizing when it grows too large.
5. LangChain and Standardized Agent Architecture
LangChain provides a standardized stack: RAG → Prompt → LLM → Tool → Observation, mirroring classic control‑theory loops. Core concepts include:
ReAct : while + if logic.
Workflow : conditional branching (if/elif/else).
LangGraph : DAG‑based orchestration.
LangChain abstracts message roles (system, user, assistant, tool) and offers built‑in memory, vector stores, and tool wrappers, reducing boilerplate for agent developers.
6. Performance Perspective
Hardware constraints dominate LLM performance. Key points:
GPU compute (e.g., NVIDIA B300) provides 15 PFLOPS FP8, but data must travel through HBM, NVLink, and PCIe, each adding latency.
Memory bandwidth (HBM3E 8 TB/s) determines how fast a 70 B model can be loaded into compute cores.
PCIe is a bottleneck compared to NVLink, explaining why local inference on consumer hardware feels sluggish.
Scaling laws and MoE architectures shift computation across multiple GPUs, increasing inter‑GPU traffic.
7. Historical Evolution of Neural Networks
From the 1957 perceptron to modern Transformers, the article traces key milestones:
Perceptron and MLP (1960s‑1970s) introduced linear models and the need for non‑linear activation.
Back‑propagation (1986) enabled training of deep networks.
CNNs (1998) leveraged spatial sparsity for image tasks.
AlexNet (2012) demonstrated GPU‑accelerated deep learning.
Transformers (2017) unified attention, feed‑forward, residual, and layer‑norm blocks, leading to decoder‑only LLMs.
The article then walks through a concrete inference flow of Qwen2.5‑7B, detailing tokenization, embedding, KV‑cache usage, multi‑head attention, FFN expansion, and final token generation.
8. Practical Takeaways
When building AI agents, consider:
Defining clear boundaries for tool usage to avoid dangerous commands.
Implementing runtime evaluation and fallback mechanisms.
Monitoring latency and token cost to decide when to switch from LLM‑driven logic to deterministic code.
Using multi‑agent or ReAct patterns based on problem complexity.
Overall, the article argues that AI agents shift complexity from deterministic code to model‑driven inference, requiring new engineering disciplines around uncertainty, feedback, and hardware awareness.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
