37 min read

Java Engineer’s Complete Guide to Enterprise LLM Apps: LLM, Agent, RAG & Skill

This article walks Java engineers through building production‑grade enterprise AI assistants, explaining the roles of LLM, RAG, Agent and Skill, detailing a layered architecture, best‑practice code samples, deployment strategies, observability, security and cost‑control considerations.

Ray's Galactic Tech

Apr 27, 2026

Java Engineer’s Complete Guide to Enterprise LLM Apps: LLM, Agent, RAG & Skill

Why Java engineers should read this article

Large language models (LLMs) are no longer experimental; they are entering core business workflows. For Java developers the real challenge is not merely calling an LLM API, but handling hallucinations, integrating enterprise knowledge bases, managing high‑concurrency traffic, and providing gray‑release, observability, scalability and rollback capabilities.

Unified view of LLM, RAG, Agent and Skill

In production systems these four components form a single chain: LLM – understands, reasons and generates text. RAG – injects real enterprise knowledge into the context to suppress hallucinations. Agent – decides when to retrieve, when to call tools and when to finish. Skill – turns tool capabilities into stable, autonomous, governable business units.

The article explains each component in depth, showing why they must be orchestrated together rather than treated as isolated modules.

Enterprise‑grade architecture (not a single controller → OpenAI)

A production‑ready architecture must address request entry, rate‑limiting, conversation state, intent routing, retrieval‑augmented generation, agent orchestration, skill execution, observability, caching, cost control and async back‑pressure. The recommended reference diagram includes:

Client / Web / App

API Gateway

Chat Service (orchestration entry)

Intent Router

RAG Pipeline

Agent Orchestrator

Embedding Service

Vector DB (Milvus / PgVector)

Rerank Service

Skill Registry

Order / Refund / IT Ticket Skills

Redis Session / Cache

Kafka Async Bus

Observability (Prometheus / Grafana / ELK)

LLM Gateway (OpenAI / Qwen / Private Model)

Chat Service responsibilities

Receive requests

Maintain context

Invoke intent router

Dispatch to RAG, Agent or simple QA

Aggregate results and return

LLM Gateway abstraction

The gateway must unify model routing, timeout control, retry strategy, rate‑limiting, token accounting and request logging. Without it, switching models or moving from cloud to on‑premise inference becomes costly.

Skill engineering principles

Clear input‑output contracts

Permission rules

Retry, circuit‑breaker and idempotency

Audit logging

SLA metrics

Deep dive into RAG pitfalls and best practices

RAG failures are usually caused by poor retrieval chain design rather than the model itself. Key points:

Chunk size matters – too large mixes topics, too small loses semantic completeness.

Logical slicing (by headings, paragraphs, tables) before fixed‑length slicing improves relevance.

Typical token limits: 500‑800 for policy docs, 200‑400 for FAQs.

Retrieval should be multi‑stage:

Coarse recall (vector Top‑K)

Metadata filtering (tenant, department, time range)

Rerank with a dedicated model

Final clipping to control token budget.

Force RAG for policy‑type intents; let the Agent handle open‑ended queries.

Code walkthroughs

Maven dependencies (production‑grade)

<dependencies>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-webflux</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-validation</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-data-redis-reactive</artifactId>
    </dependency>
    <dependency>
        <groupId>org.springframework.kafka</groupId>
        <artifactId>spring-kafka</artifactId>
    </dependency>
    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j-open-ai</artifactId>
        <version>0.35.0</version>
    </dependency>
    <dependency>
        <groupId>dev.langchain4j</groupId>
        <artifactId>langchain4j</artifactId>
        <version>0.35.0</version>
    </dependency>
    <dependency>
        <groupId>io.milvus</groupId>
        <artifactId>milvus-sdk-java</artifactId>
        <version>2.5.8</version>
    </dependency>
    <dependency>
        <groupId>io.github.resilience4j</groupId>
        <artifactId>resilience4j-spring-boot3</artifactId>
    </dependency>
    <dependency>
        <groupId>io.micrometer</groupId>
        <artifactId>micrometer-registry-prometheus</artifactId>
    </dependency>
    <dependency>
        <groupId>com.github.ben-manes.caffeine</groupId>
        <artifactId>caffeine</artifactId>
    </dependency>
</dependencies>

Configuration file (YAML‑style excerpt)

server:
  port: 8081

spring:
  application:
    name: smart-ai-assistant
  data:
    redis:
      host: redis
      port: 6379
  kafka:
    bootstrap-servers: kafka:9092

management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus

ai:
  llm:
    provider: openai
    chat-model: gpt-4.1-mini
    embedding-model: text-embedding-3-small
    timeout-seconds: 20
    max-retries: 2
    temperature: 0.0
  rag:
    top-k: 8
    final-top-n: 3
    min-score: 0.72
    chunk-size: 700
    chunk-overlap: 120
  agent:
    max-iters: 5
    force-rag-intents:
      - POLICY_RAG
      - FAQ_RAG
  session:
    history-size: 10
    summary-threshold: 30
  skill:
    async-timeout-ms: 3000

resilience4j:
  ratelimiter:
    instances:
      llmGateway:
        limit-for-period: 120
        limit-refresh-period: 1s
        timeout-duration: 200ms
  circuitbreaker:
    instances:
      llmGateway:
        failure-rate-threshold: 50
        sliding-window-size: 20

Intent routing service (Java)

package com.example.aiqa.domain.service;

import com.example.aiqa.domain.model.ChatCommand;
import com.example.aiqa.domain.model.IntentType;
import org.springframework.stereotype.Service;

@Service
public class IntentRouter {
    public IntentType route(ChatCommand command) {
        String message = command.message().trim().toLowerCase();
        if (containsAny(message, "你好", "hello", "hi", "在吗")) {
            return IntentType.SMALL_TALK;
        }
        if (containsAny(message, "年假", "报销", "制度", "流程", "政策", "规范", "激励")) {
            return IntentType.POLICY_RAG;
        }
        if (containsAny(message, "余额", "工号", "员工", "部门", "档案")) {
            return IntentType.EMPLOYEE_QUERY;
        }
        if (containsAny(message, "创建工单", "提工单", "申请", "审批", "报修", "vpn")) {
            return IntentType.ACTION_REQUEST;
        }
        if (message.length() <= 8) {
            return IntentType.FAQ;
        }
        return IntentType.COMPLEX;
    }

    private boolean containsAny(String message, String... keywords) {
        for (String keyword : keywords) {
            if (message.contains(keyword)) {
                return true;
            }
        }
        return false;
    }
}

Milvus knowledge service (RAG core)

@Service
public class MilvusKnowledgeService {
    private final QaProperties properties;
    private final EmbeddingService embeddingService;
    private volatile MilvusClientV2 client;
    private volatile String unavailableReason;

    public List<KnowledgeDocument> search(String question, String department, int topK) {
        if (!isAvailable()) {
            return List.of();
        }
        initializeCollection();
        List<Float> queryVector = embeddingService.embed(question);
        SearchReq searchReq = SearchReq.builder()
                .databaseName(properties.milvus().dbName())
                .collectionName(properties.milvus().collectionName())
                .data(List.of(new FloatVec(queryVector)))
                .topK(topK)
                .annsField(properties.milvus().vectorField())
                .metricType(IndexParam.MetricType.COSINE)
                .filter(buildFilter(department))
                .outputFields(List.of("*"))
                .build();
        SearchResp resp = client().search(searchReq);
        List<KnowledgeDocument> result = new ArrayList<>();
        for (List<SearchResp.SearchResult> group : resp.getSearchResults()) {
            for (SearchResp.SearchResult item : group) {
                Map<String, Object> entity = item.getEntity();
                result.add(new KnowledgeDocument(
                        String.valueOf(entity.getOrDefault("documentId", "")),
                        String.valueOf(entity.getOrDefault("title", "")),
                        String.valueOf(entity.getOrDefault("content", "")),
                        String.valueOf(entity.getOrDefault("source", "")),
                        String.valueOf(entity.getOrDefault("department", "")),
                        null));
            }
        }
        return result;
    }

    public String searchAsText(String question, String department, int topK) {
        List<KnowledgeDocument> docs = search(question, department, topK);
        if (docs.isEmpty()) {
            return "未检索到知识。";
        }
        return docs.stream()
                .map(doc -> "【" + doc.title() + "】" + doc.content() + "（来源：" + doc.source() + "）")
                .collect(Collectors.joining("
"));
    }
}

Agent orchestration example (ReAct pattern)

@Service
public class AgentScopeQaService {
    private final QaProperties properties;
    private final IntentRouter intentRouter;
    private final EnterpriseToolFunctions toolFunctions;

    public AgentAnswer answer(ChatContext context) {
        String intent = intentRouter.route(context.question());
        if ("RAG".equals(intent)) {
            return answerWithForcedRag(context);
        }
        if (properties.dashscopeApiKey() == null || properties.dashscopeApiKey().isBlank()) {
            return fallbackAnswer(intent, context);
        }
        DashScopeChatModel model = DashScopeChatModel.builder()
                .apiKey(properties.dashscopeApiKey())
                .modelName(properties.chatModel())
                .formatter(new DashScopeChatFormatter())
                .build();
        Toolkit toolkit = new Toolkit();
        toolkit.registerTool(new AgentTools(toolFunctions, context.department(), context.userId()));
        ReActAgent agent = ReActAgent.builder()
                .name("enterprise-qa-agent")
                .sysPrompt(buildPrompt(intent, context))
                .model(model)
                .toolkit(toolkit)
                .memory(new InMemoryMemory())
                .maxIters(properties.agent().maxIters())
                .build();
        Msg msg = Msg.builder().name("user").role(MsgRole.USER).textContent(context.question()).build();
        Msg response = agent.call(msg).block();
        String answer = response == null ? fallbackAnswer(intent, context).answer() : response.getTextContent();
        return new AgentAnswer(answer, "AGENTSCOPE_REACT", List.of(), ToolAuditContext.snapshot());
    }
}

Conversation session store (file‑based prototype)

@Component
public class ConversationSessionStore {
    private final QaProperties properties;
    private final ObjectMapper objectMapper;

    public List<String> history(String sessionId) {
        try {
            Path path = sessionFile(sessionId);
            if (!Files.exists(path)) {
                return new ArrayList<>();
            }
            return objectMapper.readValue(Files.readString(path), new TypeReference<List<String>>() {});
        } catch (Exception ex) {
            return new ArrayList<>();
        }
    }

    public void append(String sessionId, String role, String message) {
        try {
            Path path = sessionFile(sessionId);
            Files.createDirectories(path.getParent());
            List<String> history = history(sessionId);
            history.add(role + ": " + message);
            int maxSize = properties.session().historySize() * 2;
            if (history.size() > maxSize) {
                history = new ArrayList<>(history.subList(history.size() - maxSize, history.size()));
            }
            Files.writeString(path, objectMapper.writerWithDefaultPrettyPrinter().writeValueAsString(history));
        } catch (Exception ex) {
            throw new IllegalStateException("Failed to persist session history", ex);
        }
    }
}

WebFlux SSE streaming controller

@RestController
@RequestMapping("/api/chat")
public class ChatController {
    @GetMapping(value = "/stream", produces = MediaType.TEXT_EVENT_STREAM_VALUE)
    public Flux<String> streamChat(@RequestParam String sessionId, @RequestParam String message) {
        return llmStreamingService.stream(sessionId, message);
    }
}

LLM Gateway with resilience (Resilience4j)

@Service
public class LlmGateway {
    @RateLimiter(name = "llmGateway")
    @CircuitBreaker(name = "llmGateway", fallbackMethod = "fallback")
    public Mono<String> chat(ChatRequest request) {
        return remoteModelClient.chat(request)
                .timeout(Duration.ofSeconds(20));
    }

    public Mono<String> fallback(ChatRequest request, Throwable ex) {
        return Mono.just("当前智能服务繁忙，我先为你返回简化结果，请稍后重试更复杂问题。");
    }
}

Observability, governance and security

Key metrics to monitor include request volume, success/error rates, latency percentiles (P50/P95/P99), LLM call counts and costs, embedding latency, vector retrieval latency, cache hit ratio, token consumption per request, and per‑day model cost.

Log chains must contain traceId, user request, intent, retrieved evidence, prompt version, tool call records (name, args, result, timestamp), model response and final answer.

Prompt management should be externalized to a configuration center with versioning, gray‑release and A/B testing support.

Security measures: prompt‑injection protection, tenant‑aware metadata filtering, data redaction before model input, strict permission checks for high‑risk Skills, and mandatory audit for write‑operations.

Scalability and deployment

Multi‑level caching strategy (L1 Caffeine, L2 Redis, L3 vector DB / LLM) reduces expensive calls. Semantic caching (embedding‑based similarity) improves hit rates for paraphrased queries.

Stateless design (session in Redis, async tasks in DB/Redis, prompts in config service) enables horizontal scaling via Kubernetes HPA.

Dockerfile example

FROM eclipse-temurin:17-jre
WORKDIR /app
COPY target/smart-ai-assistant.jar app.jar
ENTRYPOINT ["java","-Xms1g","-Xmx1g","-XX:+UseG1GC","-XX:MaxGCPauseMillis=200","-jar","app.jar"]

Kubernetes deployment snippet

apiVersion: apps/v1
kind: Deployment
metadata:
  name: smart-ai-assistant
spec:
  replicas: 3
  selector:
    matchLabels:
      app: smart-ai-assistant
  template:
    metadata:
      labels:
        app: smart-ai-assistant
    spec:
      containers:
        - name: app
          image: registry.example.com/smart-ai-assistant:1.0.0
          ports:
            - containerPort: 8081
          env:
            - name: OPENAI_API_KEY
              valueFrom:
                secretKeyRef:
                  name: llm-secret
                  key: api-key
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "3Gi"
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: smart-ai-assistant-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: smart-ai-assistant
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

Common pitfalls checklist

Skipping rerank after vector retrieval.

Allowing the Agent unrestricted tool access.

Unbounded concatenation of conversation history.

Tool outputs without a stable schema.

Lack of fallback strategies for downstream failures.

Testing only functionality without measuring cost and latency.

Roadmap for production rollout

Stage 1 – LLM API integration: basic chat, prompt standards, streaming.

Stage 2 – Add RAG: document ingestion pipeline, evidence‑based answering, evaluation.

Stage 3 – Controlled Agent: tool calling, intent routing, audit.

Stage 4 – Skill engineering: service isolation, permission, timeout, idempotency, async execution.

Stage 5 – Platformization: unified LLM gateway, prompt platform, multi‑model routing, observability.

Conclusion

The real barrier for Java engineers is not “calling a model” but “building a correct system”. By combining LLM, RAG, Agent and Skill in a well‑engineered backend, teams achieve trustworthy, executable, scalable and cost‑controlled AI assistants that fit enterprise needs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend java LLM RAG Agent Skill

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.