From Bottlenecks to a High‑Concurrency Medical Assistant with LangChain4j
This guide details how to design and implement a production‑grade, high‑concurrency medical AI assistant using LangChain4j, Spring Boot, Redis, and Kubernetes, covering architecture, RAG‑enhanced retrieval, controlled tool invocation, guardrails, idempotent transactions, scaling strategies and observability to ensure reliable, compliant patient interactions.
Business Background and Requirements
The assistant must support four capability domains: medical QA, department guidance, process consulting, and convenience services. Each domain has distinct risk levels and technical requirements, ranging from knowledge retrieval to real‑time transaction handling.
User‑side goals : first‑token latency < 1 s, instant answers for common questions, multi‑turn context, and answers bounded by medical authority.
Platform goals : elastic scaling for peak loads, session isolation, tool‑call auditing, traceability, multi‑model switching, and gray‑release of knowledge updates.
Safety goals : answer only within authorized scope, reject high‑risk or illegal queries, and always direct users to offline care for critical symptoms.
Overall Architecture
LLM is placed in a controlled middle‑layer. The flow is:
Web / App / Mini‑Program → API Gateway / WAF / Auth → Medical Agent Service (LangChain4j) →
Intent Router, Session Memory, Safety Guard, RAG Orchestrator, Tool Executor, Fallback →
Knowledge Retrieval (Vector + Re‑rank) Tool Gateway (Appointment) Observability Hub
↓ ↓ ↓
Vector DB / ES HIS / Slot / Order / MQ Redis / MQ / DB (cache, session, event stream)The principle is “layered control”: the LLM provides reasoning, while all business rules, state, and execution are enforced by surrounding services.
Why the Model Must Not Call All Services Directly
Uncontrolled calls cause thread‑pool exhaustion, connection‑pool overload, and downstream snowballing.
Model parameters can be unstable, producing dirty requests.
High‑risk operations would lack approval, idempotency, and audit.
Prompt injection can push the model beyond its medical boundaries.
Controlled AI Service Design
@AiService(
chatLanguageModel = "routingChatModel",
chatMemoryProvider = "redisChatMemoryProvider",
retrievalAugmentor = "medicalRetrievalAugmentor",
tools = "appointmentTools"
)
public interface MedicalAssistantService {
@SystemMessage("""
You are an intelligent medical assistant for a tertiary hospital. Your duties are limited to:
1. Answer medical queries based on the knowledge base.
2. When the user authorizes and information is complete, call controlled tools for slot query, appointment creation, or cancellation.
3. For high‑risk symptoms, advise emergency care and never replace a doctor’s diagnosis.
4. Refuse or hand over any out‑of‑scope, illegal, or insufficient‑information queries.
5. Never fabricate medical conclusions, slot information, pricing, or order status.
Answer format: conclusion first, then explanation.
""")
TokenStream chat(@MemoryId String sessionId, @UserMessage String userMessage);
}Key production points: SystemMessage is a policy declaration, not a persona.
Tool permissions are minimal; only expose safe, structured capabilities.
Medical boundaries and refusal conditions are hard‑coded in the system layer.
Model Splitting
Intent‑recognition model : lightweight, fast, cheap.
Answer model : handles knowledge generation.
Tool‑routing model : decides parameters and whether a tool call is allowed.
This reduces cost, latency, and concurrency pressure on the main LLM.
Core Principle – RAG + Tool + Guardrail
RAG provides bounded, source‑traceable answers.
Tool executes real transactions (slot query, appointment creation, etc.).
Guardrail enforces input filtering, prompt safety, output risk checks, and audit.
One‑sentence summary: RAG decides what the model knows, Tool decides what it can do, Guardrail decides what it must never do.
Knowledge Base Layering
Clinical knowledge layer : guidelines, consensus, drug labels – medium update frequency – vector + re‑rank retrieval.
Hospital rules layer : registration rules, visit instructions – high update frequency – keyword + vector retrieval.
Real‑time business layer : slot inventory, pricing, scheduling – real‑time – must be accessed via tools, never vectorized.
Mixing real‑time data into the vector store leads to stale answers and over‑booking.
Document Splitting Strategy
Split by chapter, section, indication, contraindication, or FAQ rather than fixed character count to preserve semantic completeness.
@Bean
public DocumentSplitter medicalDocumentSplitter() {
return DocumentSplitters.recursive(500, 80);
}Hybrid Retrieval Augmentor
@Bean
public RetrievalAugmentor medicalRetrievalAugmentor(
EmbeddingStore<TextSegment> embeddingStore,
EmbeddingModel embeddingModel,
KeywordRetriever keywordRetriever,
MedicalReranker reranker) {
ContentRetriever vectorRetriever = EmbeddingStoreContentRetriever.builder()
.embeddingStore(embeddingStore)
.embeddingModel(embeddingModel)
.maxResults(8)
.minScore(0.72)
.build();
ContentRetriever hybridRetriever = query -> {
List<Content> vector = vectorRetriever.retrieve(query).contents();
List<Content> keyword = keywordRetriever.retrieve(query);
List<Content> merged = Stream.concat(vector.stream(), keyword.stream())
.distinct().toList();
return new ContentRetrieverResult(reranker.rerank(query.text(), merged));
};
return DefaultRetrievalAugmentor.builder()
.contentRetriever(hybridRetriever)
.build();
}This reduces bias of pure vector recall and improves medical relevance.
Tool Design – Governance and Idempotency
@Component
public class AppointmentTools {
private final AppointmentApplicationService appointmentApplicationService;
private final ToolAuditService toolAuditService;
public AppointmentTools(AppointmentApplicationService a, ToolAuditService t) {
this.appointmentApplicationService = a;
this.toolAuditService = t;
}
@Tool("查询指定日期和科室的可预约号源")
public ToolResult<List<SlotView>> querySlots(
@P("用户ID") String userId,
@P("科室编码") String departmentCode,
@P("就诊日期,格式 yyyy-MM-dd") String visitDate) {
ToolInvocationContext ctx = ToolInvocationContext.current();
toolAuditService.beforeInvoke(ctx, "querySlots");
try {
List<SlotView> slots = appointmentApplicationService.querySlots(userId, departmentCode, LocalDate.parse(visitDate));
ToolResult<List<SlotView>> result = ToolResult.ok(slots);
toolAuditService.afterInvoke(ctx, "querySlots", result);
return result;
} catch (BizException ex) {
ToolResult<List<SlotView>> result = ToolResult.fail(ex.getCode(), ex.getMessage(), false);
toolAuditService.afterInvoke(ctx, "querySlots", result);
return result;
} catch (Exception ex) {
ToolResult<List<SlotView>> result = ToolResult.fail("TOOL_TEMPORARY_ERROR", "查询号源失败,请稍后再试", true);
toolAuditService.afterInvoke(ctx, "querySlots", result);
return result;
}
}
// createAppointment omitted for brevity
}Idempotent requestId prevents duplicate bookings caused by network retries, SSE interruptions, or model re‑invocation.
Lua Atomic Stock Deduction
local stockKey = KEYS[1]
local requestKey = KEYS[2]
local requestId = ARGV[1]
local ttl = tonumber(ARGV[2])
if redis.call('EXISTS', requestKey) == 1 then
return 2
end
local current = tonumber(redis.call('GET', stockKey) or '0')
if current <= 0 then
return 0
end
redis.call('DECR', stockKey)
redis.call('SET', requestKey, requestId, 'EX', ttl)
return 1Return values: 0 = out of stock, 1 = deduction succeeded, 2 = duplicate request.
Concurrency Engineering
Entry‑side throttling: gateway rate‑limit, per‑user token bucket, hotspot isolation, pre‑scale before peaks.
Model‑chain isolation: separate thread pools for pure QA vs. transaction‑heavy requests, allocate quotas for heavy vs. light models, prefer streaming responses.
Tool execution isolation: limit concurrent tool calls, circuit‑break slow tools, enforce synchronous checks for high‑risk tools.
Inventory protection: async write‑behind for appointment creation, Redis atomic decrement, compensation on failure.
Thread Model Recommendation
spring:
threads:
virtual:
enabled: true
server:
http2:
enabled: trueVirtual threads handle blocking tool calls without exhausting servlet workers.
Bulkhead and Timeout for LLM Calls
@Bean
public Bulkhead llmBulkhead() {
return Bulkhead.of("llm-call", BulkheadConfig.custom()
.maxConcurrentCalls(200)
.maxWaitDuration(Duration.ofMillis(100))
.build());
}
@Bean
public TimeLimiter llmTimeLimiter() {
return TimeLimiter.of(Duration.ofSeconds(8));
}Guardrails and Risk Classification
Three‑level prompt‑injection protection: gateway text filter, system prompt with immutable safety clauses, output‑layer risk scanner.
Risk levels (Low, Medium, High, Forbidden) dictate response strategy – from normal answer to emergency advice or outright refusal.
@Component
public class MedicalAnswerGuard {
public GuardDecision evaluate(String answer, RiskLevel riskLevel) {
if (riskLevel == RiskLevel.HIGH && !answer.contains("请立即前往急诊")) {
return GuardDecision.block("HIGH_RISK_NOT_ESCALATED");
}
if (containsPrescriptionLikeInstruction(answer)) {
return GuardDecision.block("PRESCRIPTION_RISK");
}
return GuardDecision.pass();
}
}Observability and Audit
Key metrics include QPS, first‑token latency, total response time, token usage, retrieval hit rate, tool‑call success rate, inventory deduction failures, and end‑to‑end traceability via a shared traceId.
public record AgentRequest(String traceId, String sessionId, String userId, String message) {}Auditable fields (masked for privacy): original query, normalized query, hit knowledge segment IDs, tool name & parameters, tool result code, final answer, risk level.
Deployment in a Cloud‑Native Environment
Containerize each service (agent, knowledge, appointment, governance) and scale independently.
Kubernetes HPA should consider CPU, memory, request queue length, active SSE connections, and pending tool calls.
Gray‑release by hospital, channel, user percentage, or query type to avoid full‑scale failures.
End‑to‑End Request Example
我爸这两天胸闷,帮我挂明天上午心内科,越早越好。
Risk classification – chest tightness triggers high‑risk check; if accompanied by pain or dyspnea, advise emergency.
Information gathering – verify patient profile, target hospital, slot type.
Slot query via querySlots tool (structured response, no vector cache).
User confirms; system generates idempotent requestId and calls createAppointment with atomic Redis stock deduction.
Order persisted, outbox event emitted, async notifications sent.
Streaming answer combines risk advice, slot details, and confirmation request.
从症状描述看,建议优先就诊心内科。如果胸闷伴有胸痛、明显气短或持续加重,请直接前往急诊。 我查到明天上午还有 1 个较早号源,医生为王医生,挂号费 50 元。如需我继续为您预约,请确认就诊人信息。
Common Pitfalls and Mitigations
Never embed real‑time slot data in the vector store – always use a tool.
Never let the model freely construct tool parameters – enforce server‑side validation.
Route lightweight queries to fast models or FAQ cache to reduce cost.
Always record an audit trail; without it a medical AI system cannot be deployed.
Checklist Before Launch
Architecture : rate‑limit, circuit‑break, session sharing, layered knowledge.
Business : idempotent appointment, atomic stock deduction, permission checks.
AI : risk classification, refusal/hand‑off, retrieval source logging.
Operations : dashboards for all metrics, trace‑ID tracing, gray‑release, quick rollback.
Only when stability, controllability, and traceability are proven does the assistant become truly enterprise‑grade.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
