From Zero to Production: Building AI‑Native Infrastructure for Agents – Local Inference to Full‑Scale Deployment
The article walks through constructing AI‑native infrastructure for agents, covering local inference deployment with vLLM, setting up an AI gateway using LiteLLM, implementing observability with logs, metrics, and tracing, and applying cost‑saving strategies that reduced latency, improved stability, and cut expenses by up to 60%.
Many developers focus only on the logic of AI agents and ignore the underlying infrastructure, leading to slow inference, instability, high cost, and lack of monitoring when moving to production.
In the author’s first deployment, the agent crashed on day one; after half a month of refactoring the infrastructure, stability reached 99.9%, average latency dropped from 15 s to under 2 s, and cost fell by 60%.
1. Local Inference Service Deployment
For small teams, deploying an open‑source 7 B model locally is far cheaper than using third‑party APIs and keeps data private. The author recommends vLLM, which is 2–10× faster than the native Transformers library and supports dynamic and continuous batching.
Step 1: Environment Preparation
GPU with at least 16 GB VRAM (e.g., 3090/4090/A10), 16‑core CPU, 32 GB RAM
OS: Ubuntu 22.04
NVIDIA driver ≥ 525.xx
Step 2: Install vLLM
# Install dependencies
pip install vllm==0.4.0
# Install OpenAI‑compatible API dependency
pip install openaiStep 3: Start the Inference Service
The author uses the Qwen‑7B‑Chat model, which offers good Chinese performance with a small footprint.
python -m vllm.entrypoints.openai.api_server \
--model qwen/Qwen-7B-Chat \
--trust-remote-code \
--tensor-parallel-size 1 \
--gpu-memory-utilization 0.9 \
--max-model-len 8192 \
--host 0.0.0.0 \
--port 8000After the server starts, it can be called with the same OpenAI‑compatible interface.
OpenAiChatModel chatModel = OpenAiChatModel.builder()
.baseUrl("http://localhost:8000/v1")
.apiKey("sk-xxxx")
.modelName("qwen/Qwen-7b-chat")
.build();Performance‑Optimization Tips
Quantization (e.g., --quantization awq) cuts VRAM usage by 50% with negligible loss; a 7 B model runs on 8 GB.
Enable continuous prefill ( --enable-chunked-prefill) to double throughput.
Dynamic batch size automatically balances latency and throughput; the author measured >2000 Tokens/s on a single 4090 with 50 concurrent requests and latency under 2 s.
2. AI Gateway
When multiple models are deployed, issues arise: differing API endpoints, no traffic control, no cost visibility, and no retry mechanism. An AI gateway solves these problems.
Recommended Solution: LiteLLM
Step 1: Install LiteLLM
pip install litellm litellm[proxy]Step 2: Configuration (config.yaml)
model_list:
- model_name: gpt-3.5-turbo
litellm_params:
model: openai/gpt-3.5-turbo
api_key: "sk-xxx"
- model_name: qwen-7b
litellm_params:
model: openai/qwen/Qwen-7B-Chat
api_key: "sk-xxx"
base_url: "http://localhost:8000/v1"
general_settings:
master_key: "sk-your-gateway-key"
rate_limit: "1000/hour"
allow_anonymous_access: True
tiktoken_cache_dir: "/tmp/tiktoken_cache"Step 3: Start the Gateway
litellm --config config.yaml --port 8080All model calls now go through the gateway address.
OpenAiChatModel chatModel = OpenAiChatModel.builder()
.baseUrl("http://localhost:8080")
.apiKey("sk-your-gateway-key")
.modelName("qwen-7b")
.build();Core Capabilities
Unified OpenAI‑compatible API – switching models requires only config changes.
Traffic control – rate limiting, circuit breaking, automatic retries, and fallback to other models.
Cost monitoring – per‑model and per‑user usage statistics.
Load balancing – multiple instances of the same model are automatically balanced; the author observed availability rise from 95 % to 99.9 %.
3. Observability
AI systems are harder to debug than typical services, so the author uses a three‑layer observability stack: logs, metrics, and tracing.
1. Logging
@Aspect
@Component
public class LlmCallLogAspect {
@Around("execution(* dev.langchain4j.model.chat.ChatModel.generate(..))")
public Object logLlmCall(ProceedingJoinPoint joinPoint) throws Throwable {
Object[] args = joinPoint.getArgs();
List<ChatMessage> messages = (List<ChatMessage>) args[0];
long startTime = System.currentTimeMillis();
try {
Object result = joinPoint.proceed();
log.info("LLM call succeeded, request: {}, response: {}, latency: {}ms",
messages, result, System.currentTimeMillis() - startTime);
return result;
} catch (Exception e) {
log.error("LLM call failed, request: {}, error: {}, latency: {}ms",
messages, e.getMessage(), System.currentTimeMillis() - startTime);
throw e;
}
}
}2. Metrics
Key metrics (model call count, success rate, average latency, P95 latency, tool call count, agent task completion rate, per‑user token consumption) are visualized with Prometheus + Grafana and alert rules (e.g., success rate < 99 % or latency > 5 s).
3. Tracing
Each agent task receives a TraceId; the full execution chain (user query, tool calls, model I/O, intermediate results, final answer) can be reconstructed, allowing issues to be pinpointed within minutes.
4. Cost Optimization (‑60 %)
Routing Strategy
Simple queries (≈80 %) are handled by a local 7 B model, while complex queries (≈20 %) are routed to GPT‑4, halving overall cost.
Cache
Repeated questions are cached (e.g., in Redis) and served without invoking the model, saving another ~20 %.
@Component
public class AnswerCache {
@Autowired
private RedisTemplate<String, String> redisTemplate;
public String getCachedAnswer(String question) {
String key = "answer:" + DigestUtils.md5Hex(question);
return redisTemplate.opsForValue().get(key);
}
public void cacheAnswer(String question, String answer) {
String key = "answer:" + DigestUtils.md5Hex(question);
redisTemplate.opsForValue().set(key, answer, 24, TimeUnit.HOURS);
}
}Token Optimization
Trim context to only necessary information.
Use short prompts.
Limit model’s max output length.
Combined, these measures reduced monthly spending from several thousand yuan to around one thousand while improving performance.
5. Production‑Grade Deployment Best Practices
Containerize all services with Docker for easy scaling and rollback.
Deploy across multiple availability zones to avoid single points of failure.
Use canary releases for new features before full rollout.
Regularly back up vector databases, logs, and monitoring data.
Conclusion
AI‑native infrastructure is the foundation for reliable agents; a stack of vLLM, LiteLLM, and comprehensive observability satisfies the needs of most small‑to‑medium teams with low cost and high stability. The next article will guide readers through building a usable agent in two hours.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect's Ambition
Observations, practice, and musings of an architect. Here we discuss technical implementations and career development; dissect complex systems and build cognitive frameworks. Ambitious yet grounded. Changing the world with code, connecting like‑minded readers with words.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
