From Solo Demo to Cloud‑Native: Building a High‑Availability Real‑Time Translation Bot with AgentScope Java
This article walks through the complete engineering practice of turning a single‑machine demo into a cloud‑native, highly available real‑time translation robot using AgentScope Java, covering business requirements, architecture evolution, core AgentScope concepts, code examples, deployment, observability, performance tuning, and common pitfalls.
Problem Statement
Real‑time translation for a multilingual customer‑service desk must handle continuous audio streams, sub‑second latency, context preservation, term consistency, and high concurrency (5,000+ simultaneous sessions). Simple audio → ASR → LLM → result pipelines work for demos but fail in production because they cannot govern events, upgrade the LLM to a task executor, or balance latency, accuracy and cost.
AgentScope Java Value
AgentScope Java is an intelligent‑agent runtime for Java stacks (Spring Boot, Redis, Kafka, Kubernetes). It provides a ReAct decision loop, three core abstractions ( Message , State , Tool ) and layered memory (working, session, long‑term) to build reasoning, tool‑calling and recoverable execution.
Target SLA
First‑turn ASR latency < 500 ms
Single‑segment translation latency < 800 ms
End‑to‑end P95 < 2.5 s
Critical‑path availability 99.9 %
Term‑consistency > 95 %
Evolution Roadmap
Phase 1 – Single‑Machine MVP
WebSocket receives audio chunks
Streaming ASR
AgentScope Java invokes translation agent
Return text result
Pros: simple, low cost. Cons: shared process causes resource contention, state lives only in memory, no fault tolerance, no horizontal scaling.
Phase 2 – Service Splitting
gateway-service: WebSocket/HTTP/gRPC entry asr-service: speech recognition translation-agent-service: agent execution tts-service: voice synthesis session-service: session & memory management
Benefits: clear boundaries, independent scaling, asynchronous processing, replay capability.
Phase 3 – Event‑Driven Architecture
Introduce Kafka to break the synchronous chain into an event stream:
audio.chunk.received → asr.segment.ready → translation.segment.requested → translation.segment.completed → tts.segment.requested → tts.segment.completedBenefits: peak shaving, service decoupling, retryable failures, offline analysis.
Phase 4 – Cloud‑Native High‑Availability
Kubernetes deployment
Redis for session state
Kafka for event streams
Config center (Nacos, etc.)
Sentinel/Resilience4j for rate‑limiting and circuit‑breaking
OpenTelemetry + Prometheus + Grafana + Jaeger for observability
Dual‑Channel Design
Two parallel pipelines serve different needs:
Streaming path : ultra‑low latency for subtitles, live chat, conference interpretation.
Async path : high‑throughput batch processing for long recordings or bulk jobs.
Mixing both in a single synchronous service leads to resource contention and degraded performance.
AgentScope Core Principles
ReAct Decision Loop
Thought → Action → Observation → Thought → Final AnswerIn translation the loop decides language pair, queries term‑bases, calls the LLM, and falls back when confidence is low.
Message, State, Tool
Message is a unified JSON envelope carrying text, audio metadata, system commands and tool results. Example:
{
"sessionId": "sess-1001",
"segmentId": "seg-19",
"role": "user",
"sourceLang": "es",
"targetLang": "zh",
"text": "Necesito cambiar la direccion de entrega.",
"finalSegment": true
}State captures session‑wide data such as language pair, recent N segments, glossary version, tenant ID, degradation level and last tool result. It must be externalized (Redis/MySQL) to survive restarts.
Tool represents LLM‑driven, controllable operations (e.g., detectLanguage, lookupTermbase, loadConversationContext, escalateHumanSupport, auditTranslation) with timeout, auth, idempotency and degradation handling.
Memory Layers
Working memory : per‑request temporary state inside the reactor context.
Session memory : recent dialogue stored in Redis for continuity.
Long‑term memory : user preferences, brand‑specific terms, tenant glossaries stored in MySQL or vector stores.
Example: when a user mentions “Prime Day”, the term is cached in session memory so subsequent segments reuse the same translation.
Engineering Details
Session State Record (Java 17)
public record TranslationSessionState(
String sessionId,
String tenantId,
String userId,
String sourceLang,
String targetLang,
String glossaryVersion,
int degradeLevel,
List<String> recentSegments,
Map<String,String> attributes,
Instant updatedAt
) {}Redis Repository
@Repository
public class TranslationSessionRepository {
private final StringRedisTemplate redisTemplate;
private final ObjectMapper objectMapper;
// findBySessionId, save, key generation omitted for brevity
private String key(String sessionId) { return "translation:session:" + sessionId; }
}Key design: translation:session:{sessionId}. JSON serialization enables debugging and replay.
High‑Concurrency Design
Audio split into 200‑1000 ms chunks (≈500 ms typical for live chat).
Separate thread pools: acceptor (connections), audio processing, agent execution, callback.
Back‑pressure: pause intake when session buffer exceeds threshold; trigger lightweight degradation when agent queues grow; fallback to text‑only when TTS unavailable.
Kafka Topics (minimal design)
audio-chunk-topic – received audio chunks
asr-segment-topic – stable transcription results
translation-request-topic – pending translation segments
translation-result-topic – completed translations
tts-request-topic – pending TTS synthesis
audit-event-topic – audit, tracing, replay eventsUse sessionId as the Kafka key to guarantee ordering per conversation.
Translation Agent Service
@Service
public class TranslationAgentService {
private final TranslationSessionRepository sessionRepository;
private final GlossaryTool glossaryTool;
private final LlmTranslationClient llmClient;
private final FastTranslationClient fastClient;
private final AuditPublisher auditPublisher;
public TranslationResult translate(TranslationCommand cmd) {
TranslationSessionState state = sessionRepository.findBySessionId(cmd.sessionId())
.orElseGet(() -> TranslationSessionStateFactory.initialize(cmd));
Map<String,String> glossary = glossaryTool.lookup(
cmd.tenantId(), cmd.sourceLang(), cmd.targetLang(), cmd.text());
List<String> context = ContextWindowUtil.appendAndTrim(
state.recentSegments(), cmd.text(), 8);
TranslationRequest request = new TranslationRequest(
cmd.tenantId(), cmd.sessionId(), cmd.sourceLang(), cmd.targetLang(),
cmd.text(), context, glossary);
TranslationResult result;
try {
result = llmClient.translate(request);
} catch (Exception e) {
result = fastClient.translate(request);
}
TranslationSessionState newState = new TranslationSessionState(
state.sessionId(), state.tenantId(), state.userId(),
cmd.sourceLang(), cmd.targetLang(), state.glossaryVersion(),
state.degradeLevel(), context, state.attributes(), Instant.now());
sessionRepository.save(newState, Duration.ofMinutes(30));
auditPublisher.publish(cmd, result, glossary);
return result;
}
}Principles: restore session state first, prefer high‑quality LLM and fall back to fast translation on error, and emit an audit event for replay.
WebSocket Handler (thin entry layer)
@Component
public class TranslationWebSocketHandler extends BinaryWebSocketHandler {
private final AudioChunkDispatcher dispatcher;
private final SessionRegistry registry;
@Override
public void afterConnectionEstablished(WebSocketSession session) {
registry.register(session);
}
@Override
protected void handleBinaryMessage(WebSocketSession session, BinaryMessage message) {
AudioChunk chunk = AudioChunk.from(session, message.getPayload().array());
dispatcher.dispatch(chunk);
}
@Override
public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
registry.unregister(session.getId());
}
}The handler only manages connections, parses the protocol and forwards audio chunks; it never calls ASR, translation or TTS directly.
Kafka Event Publishing & Consumption
@Component
public class TranslationEventPublisher {
private final KafkaTemplate<String,Object> kafka;
public void publishTranslationRequest(TranslationCommand cmd) {
kafka.send("translation-request-topic", cmd.sessionId(), cmd);
}
}
@Component
public class TranslationRequestConsumer {
private final TranslationAgentService agentService;
private final TranslationResultPushService pushService;
@KafkaListener(topics = "translation-request-topic", groupId = "translation-agent-group")
public void consume(TranslationCommand cmd) {
TranslationResult result = agentService.translate(cmd);
pushService.push(result);
}
}Key points: use sessionId as Kafka key, keep consumer logic non‑blocking, and decouple result push from execution.
Fallback Service
@Service
public class TranslationFallbackService {
public TranslationResult fallback(TranslationCommand cmd, Throwable t) {
return new TranslationResult(
cmd.sessionId(), cmd.segmentId(), cmd.sourceLang(), cmd.targetLang(),
"[System switched to fast translation] " + cmd.text(),
"FAST_FALLBACK", Instant.now());
}
}Typical degradations: LLM timeout → fast MT; TTS unavailable → text‑only; term‑base timeout → cached terms.
Observability & Governance
Metrics (Prometheus)
translation_request_total translation_latency_ms translation_fallback_total asr_segment_latency_ms tts_first_audio_latency_ms active_session_count kafka_consumer_lagTracing
Use sessionId + segmentId as the trace identifier and instrument each stage: WebSocket entry, audio chunking, ASR, term‑lookup, LLM, TTS, result push.
Logging & Audit
User input summary
Tool invocation records
Degradation decision reasons
Key state snapshots
Final output
Separate business logs (debugging) from audit logs (replay, compliance).
Performance Optimization Checklist
Prioritized Items
Validate audio segment size (≈500 ms for live chat).
Enable connection reuse for downstream model calls.
Reduce serialization overhead of session state.
Monitor and control Kafka lag during peaks.
Detect hotspot sessions and balance them across nodes.
High‑Value Techniques
Context window trimming : keep only the latest 5‑8 segments, summarize older dialogue, store structured facts separately.
Dual‑layer translation : fast baseline model for most segments, LLM enhancement for low‑confidence or critical turns.
Connection pooling & async calls : HTTP pool, clear timeouts, concurrency caps, avoid blocking the consumer pipeline.
Tenant isolation : dedicated Kafka topics, consumer groups, and resource quotas for large customers.
Common Pitfalls
Focusing on average latency while ignoring P95/P99 spikes.
Storing long‑connection state only in pod memory – leads to loss on restart.
Routing every request through a high‑cost LLM – causes cost explosion and timeouts.
Missing governance on tool calls – the agent merely shifts the bottleneck downstream.
Lack of replay capability – impossible to debug translation errors without stored events.
Recommended Rollout Path
Build a single‑machine MVP (WebSocket + ASR + Agent translation) and validate latency and term strategy.
Add engineering foundations: Redis session store, Kafka event bus, degradation & retry, metrics & tracing.
Split services, conduct load‑testing, identify P95 bottlenecks.
Migrate to Kubernetes, configure HPA, central config, and implement gray‑release with rollback.
Cloud‑Native Deployment
Containerization Principles
Build separate Docker images for gateway, translation‑agent, ASR, TTS, and session‑admin services.
Avoid monolithic images to keep startup fast, enable role‑based resource limits, and simplify GPU/CPU separation.
Kubernetes Strategy
Separate node pools: CPU‑only for gateway/agent/session services; GPU nodes for ASR/TTS or heavy inference.
HPA based on CPU, custom queue length, Kafka lag, and active WebSocket connections.
PodDisruptionBudget and preStop hooks to gracefully drain long‑lived WebSocket connections before shutdown.
Sample Deployment (YAML)
apiVersion: apps/v1
kind: Deployment
metadata:
name: translation-agent-service
spec:
replicas: 3
selector:
matchLabels:
app: translation-agent-service
template:
metadata:
labels:
app: translation-agent-service
spec:
containers:
- name: app
image: registry.example.com/translation-agent-service:1.0.0
ports:
- containerPort: 8080
env:
- name: REDIS_HOST
value: redis-master
- name: KAFKA_BOOTSTRAP_SERVERS
value: kafka:9092
readinessProbe:
httpGet:
path: /actuator/health/readiness
port: 8080
livenessProbe:
httpGet:
path: /actuator/health/liveness
port: 8080
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "2"
memory: "4Gi"Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
