Cloud Native 29 min read

From Solo Demo to Cloud‑Native: Building a High‑Availability Real‑Time Translation Bot with AgentScope Java

This article walks through the complete engineering practice of turning a single‑machine demo into a cloud‑native, highly available real‑time translation robot using AgentScope Java, covering business requirements, architecture evolution, core AgentScope concepts, code examples, deployment, observability, performance tuning, and common pitfalls.

Ray's Galactic Tech

Apr 15, 2026

From Solo Demo to Cloud‑Native: Building a High‑Availability Real‑Time Translation Bot with AgentScope Java

Problem Statement

Real‑time translation for a multilingual customer‑service desk must handle continuous audio streams, sub‑second latency, context preservation, term consistency, and high concurrency (5,000+ simultaneous sessions). Simple audio → ASR → LLM → result pipelines work for demos but fail in production because they cannot govern events, upgrade the LLM to a task executor, or balance latency, accuracy and cost.

AgentScope Java Value

AgentScope Java is an intelligent‑agent runtime for Java stacks (Spring Boot, Redis, Kafka, Kubernetes). It provides a ReAct decision loop, three core abstractions ( Message , State , Tool ) and layered memory (working, session, long‑term) to build reasoning, tool‑calling and recoverable execution.

Target SLA

First‑turn ASR latency < 500 ms

Single‑segment translation latency < 800 ms

End‑to‑end P95 < 2.5 s

Critical‑path availability 99.9 %

Term‑consistency > 95 %

Evolution Roadmap

Phase 1 – Single‑Machine MVP

WebSocket receives audio chunks

Streaming ASR

AgentScope Java invokes translation agent

Return text result

Pros: simple, low cost. Cons: shared process causes resource contention, state lives only in memory, no fault tolerance, no horizontal scaling.

Phase 2 – Service Splitting

gateway-service

: WebSocket/HTTP/gRPC entry asr-service: speech recognition translation-agent-service: agent execution tts-service: voice synthesis session-service: session & memory management

Benefits: clear boundaries, independent scaling, asynchronous processing, replay capability.

Phase 3 – Event‑Driven Architecture

Introduce Kafka to break the synchronous chain into an event stream:

audio.chunk.received → asr.segment.ready → translation.segment.requested → translation.segment.completed → tts.segment.requested → tts.segment.completed

Benefits: peak shaving, service decoupling, retryable failures, offline analysis.

Phase 4 – Cloud‑Native High‑Availability

Kubernetes deployment

Redis for session state

Kafka for event streams

Config center (Nacos, etc.)

Sentinel/Resilience4j for rate‑limiting and circuit‑breaking

OpenTelemetry + Prometheus + Grafana + Jaeger for observability

Dual‑Channel Design

Two parallel pipelines serve different needs:

Streaming path : ultra‑low latency for subtitles, live chat, conference interpretation.

Async path : high‑throughput batch processing for long recordings or bulk jobs.

Mixing both in a single synchronous service leads to resource contention and degraded performance.

AgentScope Core Principles

ReAct Decision Loop

Thought → Action → Observation → Thought → Final Answer

In translation the loop decides language pair, queries term‑bases, calls the LLM, and falls back when confidence is low.

Message, State, Tool

Message is a unified JSON envelope carrying text, audio metadata, system commands and tool results. Example:

{
  "sessionId": "sess-1001",
  "segmentId": "seg-19",
  "role": "user",
  "sourceLang": "es",
  "targetLang": "zh",
  "text": "Necesito cambiar la direccion de entrega.",
  "finalSegment": true
}

State captures session‑wide data such as language pair, recent N segments, glossary version, tenant ID, degradation level and last tool result. It must be externalized (Redis/MySQL) to survive restarts.

Tool represents LLM‑driven, controllable operations (e.g., detectLanguage, lookupTermbase, loadConversationContext, escalateHumanSupport, auditTranslation) with timeout, auth, idempotency and degradation handling.

Memory Layers

Working memory : per‑request temporary state inside the reactor context.

Session memory : recent dialogue stored in Redis for continuity.

Long‑term memory : user preferences, brand‑specific terms, tenant glossaries stored in MySQL or vector stores.

Example: when a user mentions “Prime Day”, the term is cached in session memory so subsequent segments reuse the same translation.

Engineering Details

Session State Record (Java 17)

public record TranslationSessionState(
    String sessionId,
    String tenantId,
    String userId,
    String sourceLang,
    String targetLang,
    String glossaryVersion,
    int degradeLevel,
    List<String> recentSegments,
    Map<String,String> attributes,
    Instant updatedAt
) {}

Redis Repository

@Repository
public class TranslationSessionRepository {
    private final StringRedisTemplate redisTemplate;
    private final ObjectMapper objectMapper;
    // findBySessionId, save, key generation omitted for brevity
    private String key(String sessionId) { return "translation:session:" + sessionId; }
}

Key design: translation:session:{sessionId}. JSON serialization enables debugging and replay.

High‑Concurrency Design

Audio split into 200‑1000 ms chunks (≈500 ms typical for live chat).

Separate thread pools: acceptor (connections), audio processing, agent execution, callback.

Back‑pressure: pause intake when session buffer exceeds threshold; trigger lightweight degradation when agent queues grow; fallback to text‑only when TTS unavailable.

Kafka Topics (minimal design)

audio-chunk-topic          – received audio chunks
asr-segment-topic          – stable transcription results
translation-request-topic  – pending translation segments
translation-result-topic   – completed translations
tts-request-topic          – pending TTS synthesis
audit-event-topic          – audit, tracing, replay events

Use sessionId as the Kafka key to guarantee ordering per conversation.

Translation Agent Service

@Service
public class TranslationAgentService {
    private final TranslationSessionRepository sessionRepository;
    private final GlossaryTool glossaryTool;
    private final LlmTranslationClient llmClient;
    private final FastTranslationClient fastClient;
    private final AuditPublisher auditPublisher;

    public TranslationResult translate(TranslationCommand cmd) {
        TranslationSessionState state = sessionRepository.findBySessionId(cmd.sessionId())
            .orElseGet(() -> TranslationSessionStateFactory.initialize(cmd));
        Map<String,String> glossary = glossaryTool.lookup(
            cmd.tenantId(), cmd.sourceLang(), cmd.targetLang(), cmd.text());
        List<String> context = ContextWindowUtil.appendAndTrim(
            state.recentSegments(), cmd.text(), 8);
        TranslationRequest request = new TranslationRequest(
            cmd.tenantId(), cmd.sessionId(), cmd.sourceLang(), cmd.targetLang(),
            cmd.text(), context, glossary);
        TranslationResult result;
        try {
            result = llmClient.translate(request);
        } catch (Exception e) {
            result = fastClient.translate(request);
        }
        TranslationSessionState newState = new TranslationSessionState(
            state.sessionId(), state.tenantId(), state.userId(),
            cmd.sourceLang(), cmd.targetLang(), state.glossaryVersion(),
            state.degradeLevel(), context, state.attributes(), Instant.now());
        sessionRepository.save(newState, Duration.ofMinutes(30));
        auditPublisher.publish(cmd, result, glossary);
        return result;
    }
}

Principles: restore session state first, prefer high‑quality LLM and fall back to fast translation on error, and emit an audit event for replay.

WebSocket Handler (thin entry layer)

@Component
public class TranslationWebSocketHandler extends BinaryWebSocketHandler {
    private final AudioChunkDispatcher dispatcher;
    private final SessionRegistry registry;

    @Override
    public void afterConnectionEstablished(WebSocketSession session) {
        registry.register(session);
    }

    @Override
    protected void handleBinaryMessage(WebSocketSession session, BinaryMessage message) {
        AudioChunk chunk = AudioChunk.from(session, message.getPayload().array());
        dispatcher.dispatch(chunk);
    }

    @Override
    public void afterConnectionClosed(WebSocketSession session, CloseStatus status) {
        registry.unregister(session.getId());
    }
}

The handler only manages connections, parses the protocol and forwards audio chunks; it never calls ASR, translation or TTS directly.

Kafka Event Publishing & Consumption

@Component
public class TranslationEventPublisher {
    private final KafkaTemplate<String,Object> kafka;
    public void publishTranslationRequest(TranslationCommand cmd) {
        kafka.send("translation-request-topic", cmd.sessionId(), cmd);
    }
}

@Component
public class TranslationRequestConsumer {
    private final TranslationAgentService agentService;
    private final TranslationResultPushService pushService;
    @KafkaListener(topics = "translation-request-topic", groupId = "translation-agent-group")
    public void consume(TranslationCommand cmd) {
        TranslationResult result = agentService.translate(cmd);
        pushService.push(result);
    }
}

Key points: use sessionId as Kafka key, keep consumer logic non‑blocking, and decouple result push from execution.

Fallback Service

@Service
public class TranslationFallbackService {
    public TranslationResult fallback(TranslationCommand cmd, Throwable t) {
        return new TranslationResult(
            cmd.sessionId(), cmd.segmentId(), cmd.sourceLang(), cmd.targetLang(),
            "[System switched to fast translation] " + cmd.text(),
            "FAST_FALLBACK", Instant.now());
    }
}

Typical degradations: LLM timeout → fast MT; TTS unavailable → text‑only; term‑base timeout → cached terms.

Observability & Governance

Metrics (Prometheus)

translation_request_total

translation_latency_ms

translation_fallback_total

asr_segment_latency_ms

tts_first_audio_latency_ms

active_session_count

kafka_consumer_lag

Tracing

Use sessionId + segmentId as the trace identifier and instrument each stage: WebSocket entry, audio chunking, ASR, term‑lookup, LLM, TTS, result push.

Logging & Audit

User input summary

Tool invocation records

Degradation decision reasons

Key state snapshots

Final output

Separate business logs (debugging) from audit logs (replay, compliance).

Performance Optimization Checklist

Prioritized Items

Validate audio segment size (≈500 ms for live chat).

Enable connection reuse for downstream model calls.

Reduce serialization overhead of session state.

Monitor and control Kafka lag during peaks.

Detect hotspot sessions and balance them across nodes.

High‑Value Techniques

Context window trimming : keep only the latest 5‑8 segments, summarize older dialogue, store structured facts separately.

Dual‑layer translation : fast baseline model for most segments, LLM enhancement for low‑confidence or critical turns.

Connection pooling & async calls : HTTP pool, clear timeouts, concurrency caps, avoid blocking the consumer pipeline.

Tenant isolation : dedicated Kafka topics, consumer groups, and resource quotas for large customers.

Common Pitfalls

Focusing on average latency while ignoring P95/P99 spikes.

Storing long‑connection state only in pod memory – leads to loss on restart.

Routing every request through a high‑cost LLM – causes cost explosion and timeouts.

Missing governance on tool calls – the agent merely shifts the bottleneck downstream.

Lack of replay capability – impossible to debug translation errors without stored events.

Recommended Rollout Path

Build a single‑machine MVP (WebSocket + ASR + Agent translation) and validate latency and term strategy.

Add engineering foundations: Redis session store, Kafka event bus, degradation & retry, metrics & tracing.

Split services, conduct load‑testing, identify P95 bottlenecks.

Migrate to Kubernetes, configure HPA, central config, and implement gray‑release with rollback.

Cloud‑Native Deployment

Containerization Principles

Build separate Docker images for gateway, translation‑agent, ASR, TTS, and session‑admin services.

Avoid monolithic images to keep startup fast, enable role‑based resource limits, and simplify GPU/CPU separation.

Kubernetes Strategy

Separate node pools: CPU‑only for gateway/agent/session services; GPU nodes for ASR/TTS or heavy inference.

HPA based on CPU, custom queue length, Kafka lag, and active WebSocket connections.

PodDisruptionBudget and preStop hooks to gracefully drain long‑lived WebSocket connections before shutdown.

Sample Deployment (YAML)

apiVersion: apps/v1
kind: Deployment
metadata:
  name: translation-agent-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: translation-agent-service
  template:
    metadata:
      labels:
        app: translation-agent-service
    spec:
      containers:
      - name: app
        image: registry.example.com/translation-agent-service:1.0.0
        ports:
        - containerPort: 8080
        env:
        - name: REDIS_HOST
          value: redis-master
        - name: KAFKA_BOOTSTRAP_SERVERS
          value: kafka:9092
        readinessProbe:
          httpGet:
            path: /actuator/health/readiness
            port: 8080
        livenessProbe:
          httpGet:
            path: /actuator/health/liveness
            port: 8080
        resources:
          requests:
            cpu: "500m"
            memory: "1Gi"
          limits:
            cpu: "2"
            memory: "4Gi"

cloud-native microservices Agent architecture real-time translation

Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Problem Statement

AgentScope Java Value

Target SLA

Evolution Roadmap

Phase 1 – Single‑Machine MVP

Phase 2 – Service Splitting

Phase 3 – Event‑Driven Architecture

Phase 4 – Cloud‑Native High‑Availability

Dual‑Channel Design

AgentScope Core Principles

ReAct Decision Loop

Message, State, Tool

Memory Layers

Engineering Details

Session State Record (Java 17)

Redis Repository

High‑Concurrency Design

Kafka Topics (minimal design)

Translation Agent Service

WebSocket Handler (thin entry layer)

Kafka Event Publishing & Consumption

Fallback Service

Observability & Governance

Metrics (Prometheus)

Tracing

Logging & Audit

Performance Optimization Checklist

Prioritized Items

High‑Value Techniques

Common Pitfalls

Recommended Rollout Path

Cloud‑Native Deployment

Containerization Principles

Kubernetes Strategy

Sample Deployment (YAML)

Ray's Galactic Tech

How this landed with the community

Was this worth your time?

0 Comments

Phase 1 – Single‑Machine MVP

Phase 2 – Service Splitting

Phase 3 – Event‑Driven Architecture

Phase 4 – Cloud‑Native High‑Availability