How Kthena Enables Production‑Grade LLM Inference on Kubernetes

This article analyzes the cloud‑native challenges of deploying large‑model inference on Kubernetes and presents Kthena’s architecture—ModelServing, Router, Autoscaler, and ModelBooster—along with Volcano integration, vLLM‑Ascend setup, and a real‑world Qwen3‑235B deployment case, highlighting performance gains and future directions.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
How Kthena Enables Production‑Grade LLM Inference on Kubernetes

Cloud‑Native Challenges for Distributed LLM Inference

Deploying large‑model inference on Kubernetes encounters three major engineering problems: (1) absence of multi‑dimensional topology constraints, which leads to high‑latency all‑reduce traffic for tensor‑parallel (TP) communication; (2) orchestration gaps in Prefill‑Decode (PD) separation, causing imbalanced scaling between compute‑intensive Prefill pods and memory‑intensive Decode pods; (3) stateless ingress/services cannot observe KV‑Cache distribution, resulting in low cache‑hit rates and unnecessary recomputation.

Kthena: Declarative LLM Orchestration Platform

Kthena, a project from the Volcano community, transforms distributed inference workloads into topology‑aware atomic scheduling units by deeply integrating Volcano’s batch scheduling capabilities.

ModelServing – Load Modeling for LLMs

ModelServing is the execution layer that runs containers such as vLLM‑Ascend. It is organized into three tiers:

ModelServing Layer : Manages multiple ServingGroup instances, providing global topology‑aware scheduling and gang scheduling.

ServingGroup Layer : The smallest unit that completes a full inference request, typically a Prefill pod plus several Decode pods. It supports seamless reconstruction during upgrades.

Role Layer : Defines concrete roles (compute or storage) with Entry/Worker pod templates to optimize intra‑node communication.

Kthena Router – Intelligent Traffic Hub

The Router routes requests based on model name, custom headers, or URI patterns and includes plugins such as Least Request and Random. It natively supports PD separation, directing Prefill and Decode phases to the appropriate pods, which improves hardware utilization.

KV‑Cache awareness is implemented via a ScorePlugin that matches incoming token sequences with existing cache prefixes. This cache‑aware routing yields up to 2.7× throughput increase, ~ 73.5% reduction in first‑token latency, and > 60% end‑to‑end latency improvement.

Additional features include LoRA hot‑plug routing, token‑level rate limiting, gray‑release, and automatic failover.

Kthena Autoscaler – Smart Scaling for LLM Workloads

The Autoscaler monitors queue length and GPU utilization separately for Prefill and Decode pods, scaling each independently. This avoids the bottlenecks that a traditional Horizontal Pod Autoscaler (HPA) cannot address.

ModelBooster – One‑Click Deployment

ModelBooster abstracts the myriad Kubernetes resources (Deployment, Service, ConfigMap, Secret, etc.) required for LLM inference. Users provide only model metadata; ModelBooster generates the necessary manifests and manages the lifecycle.

Deep Integration with Volcano

Multi‑Level Network Topology Awareness : Volcano’s HyperNode CRD represents racks and Top‑of‑Rack switches, allowing the scheduler to lock strongly communicating pods onto the same rack, reducing inter‑node latency to 1‑2 µs.

Atomic Gang Scheduling : The PodGroup mechanism treats all Prefill and Decode pods of a task as a single entity, guaranteeing simultaneous resource allocation and preventing partial launches or deadlocks.

vLLM‑Ascend Installation

Install the engine with two pip commands: pip install vllm==0.13.0 vllm-ascend==0.13.0 Or pull the pre‑built container image: quay.io/ascend/vllm-ascend:latest Source code is available at https://github.com/vllm-project/vllm-ascend.

Production Architecture: vLLM‑Ascend + Kthena

The stack consists of three layers:

Infrastructure layer – Kubernetes cluster, Volcano scheduler, compute nodes.

Kthena core component layer – controllers, ModelServing, Router, Autoscaler.

Inference engine layer – vLLM‑Ascend pods.

Case Study: Qwen3‑235B Dual‑Node Inference

Deploying the 235‑billion‑parameter Qwen3 model on two 16‑GPU machines follows three declarative steps:

Create a ConfigMap containing model path and parallelism settings: kubectl apply -f config.yaml -n vllm-project Deploy the ModelServing workload with the inference pod template and a headless service:

kubectl apply -f model_server.yaml -n vllm-project

Deploy the Router resources to expose the service: kubectl apply -f router.yaml -n vllm-project This declarative approach reduces the number of Kubernetes objects from dozens to a few simple commands.

Conclusion and Outlook

Kthena demonstrates that production‑grade LLM services can be built on mature cloud‑native stacks by extending Kubernetes APIs and schedulers. Future work may include model‑specific scheduling policies, smarter cache replacement algorithms, and deeper service‑mesh integration.

Technical references:

Kthena website: https://kthena.volcano.sh/

Volcano GitHub organization: https://github.com/volcano-sh

Kthena repository: https://github.com/volcano-sh/kthena/

cloud-nativeLLMKubernetesVolcanoKthenavLLM-Ascend
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.