Java Engineer’s Complete Guide to Enterprise LLM Apps: LLM, Agent, RAG & Skill
This article walks Java engineers through building production‑grade enterprise AI assistants, explaining the roles of LLM, RAG, Agent and Skill, detailing a layered architecture, best‑practice code samples, deployment strategies, observability, security and cost‑control considerations.
Why Java engineers should read this article
Large language models (LLMs) are no longer experimental; they are entering core business workflows. For Java developers the real challenge is not merely calling an LLM API, but handling hallucinations, integrating enterprise knowledge bases, managing high‑concurrency traffic, and providing gray‑release, observability, scalability and rollback capabilities.
Unified view of LLM, RAG, Agent and Skill
In production systems these four components form a single chain: LLM – understands, reasons and generates text. RAG – injects real enterprise knowledge into the context to suppress hallucinations. Agent – decides when to retrieve, when to call tools and when to finish. Skill – turns tool capabilities into stable, autonomous, governable business units.
The article explains each component in depth, showing why they must be orchestrated together rather than treated as isolated modules.
Enterprise‑grade architecture (not a single controller → OpenAI)
A production‑ready architecture must address request entry, rate‑limiting, conversation state, intent routing, retrieval‑augmented generation, agent orchestration, skill execution, observability, caching, cost control and async back‑pressure. The recommended reference diagram includes:
Client / Web / App
API Gateway
Chat Service (orchestration entry)
Intent Router
RAG Pipeline
Agent Orchestrator
Embedding Service
Vector DB (Milvus / PgVector)
Rerank Service
Skill Registry
Order / Refund / IT Ticket Skills
Redis Session / Cache
Kafka Async Bus
Observability (Prometheus / Grafana / ELK)
LLM Gateway (OpenAI / Qwen / Private Model)
Chat Service responsibilities
Receive requests
Maintain context
Invoke intent router
Dispatch to RAG, Agent or simple QA
Aggregate results and return
LLM Gateway abstraction
The gateway must unify model routing, timeout control, retry strategy, rate‑limiting, token accounting and request logging. Without it, switching models or moving from cloud to on‑premise inference becomes costly.
Skill engineering principles
Clear input‑output contracts
Permission rules
Retry, circuit‑breaker and idempotency
Audit logging
SLA metrics
Deep dive into RAG pitfalls and best practices
RAG failures are usually caused by poor retrieval chain design rather than the model itself. Key points:
Chunk size matters – too large mixes topics, too small loses semantic completeness.
Logical slicing (by headings, paragraphs, tables) before fixed‑length slicing improves relevance.
Typical token limits: 500‑800 for policy docs, 200‑400 for FAQs.
Retrieval should be multi‑stage:
Coarse recall (vector Top‑K)
Metadata filtering (tenant, department, time range)
Rerank with a dedicated model
Final clipping to control token budget.
Force RAG for policy‑type intents; let the Agent handle open‑ended queries.
Code walkthroughs
Maven dependencies (production‑grade)
<dependencies>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-webflux</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-validation</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-redis-reactive</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.kafka</groupId>
<artifactId>spring-kafka</artifactId>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-ai</artifactId>
<version>0.35.0</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j</artifactId>
<version>0.35.0</version>
</dependency>
<dependency>
<groupId>io.milvus</groupId>
<artifactId>milvus-sdk-java</artifactId>
<version>2.5.8</version>
</dependency>
<dependency>
<groupId>io.github.resilience4j</groupId>
<artifactId>resilience4j-spring-boot3</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
<groupId>com.github.ben-manes.caffeine</groupId>
<artifactId>caffeine</artifactId>
</dependency>
</dependencies>Configuration file (YAML‑style excerpt)
server:
port: 8081
spring:
application:
name: smart-ai-assistant
data:
redis:
host: redis
port: 6379
kafka:
bootstrap-servers: kafka:9092
management:
endpoints:
web:
exposure:
include: health,info,metrics,prometheus
ai:
llm:
provider: openai
chat-model: gpt-4.1-mini
embedding-model: text-embedding-3-small
timeout-seconds: 20
max-retries: 2
temperature: 0.0
rag:
top-k: 8
final-top-n: 3
min-score: 0.72
chunk-size: 700
chunk-overlap: 120
agent:
max-iters: 5
force-rag-intents:
- POLICY_RAG
- FAQ_RAG
session:
history-size: 10
summary-threshold: 30
skill:
async-timeout-ms: 3000
resilience4j:
ratelimiter:
instances:
llmGateway:
limit-for-period: 120
limit-refresh-period: 1s
timeout-duration: 200ms
circuitbreaker:
instances:
llmGateway:
failure-rate-threshold: 50
sliding-window-size: 20Intent routing service (Java)
package com.example.aiqa.domain.service;
import com.example.aiqa.domain.model.ChatCommand;
import com.example.aiqa.domain.model.IntentType;
import org.springframework.stereotype.Service;
@Service
public class IntentRouter {
public IntentType route(ChatCommand command) {
String message = command.message().trim().toLowerCase();
if (containsAny(message, "你好", "hello", "hi", "在吗")) {
return IntentType.SMALL_TALK;
}
if (containsAny(message, "年假", "报销", "制度", "流程", "政策", "规范", "激励")) {
return IntentType.POLICY_RAG;
}
if (containsAny(message, "余额", "工号", "员工", "部门", "档案")) {
return IntentType.EMPLOYEE_QUERY;
}
if (containsAny(message, "创建工单", "提工单", "申请", "审批", "报修", "vpn")) {
return IntentType.ACTION_REQUEST;
}
if (message.length() <= 8) {
return IntentType.FAQ;
}
return IntentType.COMPLEX;
}
private boolean containsAny(String message, String... keywords) {
for (String keyword : keywords) {
if (message.contains(keyword)) {
return true;
}
}
return false;
}
}Milvus knowledge service (RAG core)
@Service
public class MilvusKnowledgeService {
private final QaProperties properties;
private final EmbeddingService embeddingService;
private volatile MilvusClientV2 client;
private volatile String unavailableReason;
public List<KnowledgeDocument> search(String question, String department, int topK) {
if (!isAvailable()) {
return List.of();
}
initializeCollection();
List<Float> queryVector = embeddingService.embed(question);
SearchReq searchReq = SearchReq.builder()
.databaseName(properties.milvus().dbName())
.collectionName(properties.milvus().collectionName())
.data(List.of(new FloatVec(queryVector)))
.topK(topK)
.annsField(properties.milvus().vectorField())
.metricType(IndexParam.MetricType.COSINE)
.filter(buildFilter(department))
.outputFields(List.of("*"))
.build();
SearchResp resp = client().search(searchReq);
List<KnowledgeDocument> result = new ArrayList<>();
for (List<SearchResp.SearchResult> group : resp.getSearchResults()) {
for (SearchResp.SearchResult item : group) {
Map<String, Object> entity = item.getEntity();
result.add(new KnowledgeDocument(
String.valueOf(entity.getOrDefault("documentId", "")),
String.valueOf(entity.getOrDefault("title", "")),
String.valueOf(entity.getOrDefault("content", "")),
String.valueOf(entity.getOrDefault("source", "")),
String.valueOf(entity.getOrDefault("department", "")),
null));
}
}
return result;
}
public String searchAsText(String question, String department, int topK) {
List<KnowledgeDocument> docs = search(question, department, topK);
if (docs.isEmpty()) {
return "未检索到知识。";
}
return docs.stream()
.map(doc -> "【" + doc.title() + "】" + doc.content() + "(来源:" + doc.source() + ")")
.collect(Collectors.joining("
"));
}
}Agent orchestration example (ReAct pattern)
@Service
public class AgentScopeQaService {
private final QaProperties properties;
private final IntentRouter intentRouter;
private final EnterpriseToolFunctions toolFunctions;
public AgentAnswer answer(ChatContext context) {
String intent = intentRouter.route(context.question());
if ("RAG".equals(intent)) {
return answerWithForcedRag(context);
}
if (properties.dashscopeApiKey() == null || properties.dashscopeApiKey().isBlank()) {
return fallbackAnswer(intent, context);
}
DashScopeChatModel model = DashScopeChatModel.builder()
.apiKey(properties.dashscopeApiKey())
.modelName(properties.chatModel())
.formatter(new DashScopeChatFormatter())
.build();
Toolkit toolkit = new Toolkit();
toolkit.registerTool(new AgentTools(toolFunctions, context.department(), context.userId()));
ReActAgent agent = ReActAgent.builder()
.name("enterprise-qa-agent")
.sysPrompt(buildPrompt(intent, context))
.model(model)
.toolkit(toolkit)
.memory(new InMemoryMemory())
.maxIters(properties.agent().maxIters())
.build();
Msg msg = Msg.builder().name("user").role(MsgRole.USER).textContent(context.question()).build();
Msg response = agent.call(msg).block();
String answer = response == null ? fallbackAnswer(intent, context).answer() : response.getTextContent();
return new AgentAnswer(answer, "AGENTSCOPE_REACT", List.of(), ToolAuditContext.snapshot());
}
}Conversation session store (file‑based prototype)
@Component
public class ConversationSessionStore {
private final QaProperties properties;
private final ObjectMapper objectMapper;
public List<String> history(String sessionId) {
try {
Path path = sessionFile(sessionId);
if (!Files.exists(path)) {
return new ArrayList<>();
}
return objectMapper.readValue(Files.readString(path), new TypeReference<List<String>>() {});
} catch (Exception ex) {
return new ArrayList<>();
}
}
public void append(String sessionId, String role, String message) {
try {
Path path = sessionFile(sessionId);
Files.createDirectories(path.getParent());
List<String> history = history(sessionId);
history.add(role + ": " + message);
int maxSize = properties.session().historySize() * 2;
if (history.size() > maxSize) {
history = new ArrayList<>(history.subList(history.size() - maxSize, history.size()));
}
Files.writeString(path, objectMapper.writerWithDefaultPrettyPrinter().writeValueAsString(history));
} catch (Exception ex) {
throw new IllegalStateException("Failed to persist session history", ex);
}
}
}WebFlux SSE streaming controller
@RestController
@RequestMapping("/api/chat")
public class ChatController {
@GetMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
public Flux<String> streamChat(@RequestParam String sessionId, @RequestParam String message) {
return llmStreamingService.stream(sessionId, message);
}
}LLM Gateway with resilience (Resilience4j)
@Service
public class LlmGateway {
@RateLimiter(name = "llmGateway")
@CircuitBreaker(name = "llmGateway", fallbackMethod = "fallback")
public Mono<String> chat(ChatRequest request) {
return remoteModelClient.chat(request)
.timeout(Duration.ofSeconds(20));
}
public Mono<String> fallback(ChatRequest request, Throwable ex) {
return Mono.just("当前智能服务繁忙,我先为你返回简化结果,请稍后重试更复杂问题。");
}
}Observability, governance and security
Key metrics to monitor include request volume, success/error rates, latency percentiles (P50/P95/P99), LLM call counts and costs, embedding latency, vector retrieval latency, cache hit ratio, token consumption per request, and per‑day model cost.
Log chains must contain traceId, user request, intent, retrieved evidence, prompt version, tool call records (name, args, result, timestamp), model response and final answer.
Prompt management should be externalized to a configuration center with versioning, gray‑release and A/B testing support.
Security measures: prompt‑injection protection, tenant‑aware metadata filtering, data redaction before model input, strict permission checks for high‑risk Skills, and mandatory audit for write‑operations.
Scalability and deployment
Multi‑level caching strategy (L1 Caffeine, L2 Redis, L3 vector DB / LLM) reduces expensive calls. Semantic caching (embedding‑based similarity) improves hit rates for paraphrased queries.
Stateless design (session in Redis, async tasks in DB/Redis, prompts in config service) enables horizontal scaling via Kubernetes HPA.
Dockerfile example
FROM eclipse-temurin:17-jre
WORKDIR /app
COPY target/smart-ai-assistant.jar app.jar
ENTRYPOINT ["java","-Xms1g","-Xmx1g","-XX:+UseG1GC","-XX:MaxGCPauseMillis=200","-jar","app.jar"]Kubernetes deployment snippet
apiVersion: apps/v1
kind: Deployment
metadata:
name: smart-ai-assistant
spec:
replicas: 3
selector:
matchLabels:
app: smart-ai-assistant
template:
metadata:
labels:
app: smart-ai-assistant
spec:
containers:
- name: app
image: registry.example.com/smart-ai-assistant:1.0.0
ports:
- containerPort: 8081
env:
- name: OPENAI_API_KEY
valueFrom:
secretKeyRef:
name: llm-secret
key: api-key
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "3Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: smart-ai-assistant-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: smart-ai-assistant
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70Common pitfalls checklist
Skipping rerank after vector retrieval.
Allowing the Agent unrestricted tool access.
Unbounded concatenation of conversation history.
Tool outputs without a stable schema.
Lack of fallback strategies for downstream failures.
Testing only functionality without measuring cost and latency.
Roadmap for production rollout
Stage 1 – LLM API integration: basic chat, prompt standards, streaming.
Stage 2 – Add RAG: document ingestion pipeline, evidence‑based answering, evaluation.
Stage 3 – Controlled Agent: tool calling, intent routing, audit.
Stage 4 – Skill engineering: service isolation, permission, timeout, idempotency, async execution.
Stage 5 – Platformization: unified LLM gateway, prompt platform, multi‑model routing, observability.
Conclusion
The real barrier for Java engineers is not “calling a model” but “building a correct system”. By combining LLM, RAG, Agent and Skill in a well‑engineered backend, teams achieve trustworthy, executable, scalable and cost‑controlled AI assistants that fit enterprise needs.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
