Cloud Native 10 min read

Why HPA Falls Short for LLMs and How Kthena Autoscaler Redefines Elastic Scaling

The article explains why traditional Kubernetes HPA cannot meet the unique demands of large‑language‑model inference, introduces Kthena Autoscaler’s model‑aware architecture, its dual stable/panic scaling modes, cost‑aware algorithms, flexible policy bindings, and provides practical configuration and observability guidance.

Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Huawei Cloud Developer Alliance
Why HPA Falls Short for LLMs and How Kthena Autoscaler Redefines Elastic Scaling

As large language models (LLMs) become the core engine of modern AI applications, the focus of operations shifts from routing and orchestration to the time‑dimension resource competition: how to determine the optimal number of inference instances in real time.

1. Why LLM inference needs dedicated autoscaling

LLM workloads differ from traditional services in several ways:

Business‑metric driven : queue length and KV‑Cache utilization expose service saturation more directly than CPU or memory usage.

Burst traffic : sudden request spikes require rapid scaling to keep latency SLOs.

Prefill/Decode asymmetry : the two stages need independent, flexible scaling.

Heterogeneous hardware & cost : GPU/NPU instances have different performance‑cost trade‑offs, demanding fine‑grained scheduling.

Standard Kubernetes HPA or KEDA lack model‑awareness, which Kthena Autoscaler addresses by directly collecting pod‑level business metrics and supporting role‑level scaling and cost‑aware optimization.

2. Architecture overview

Kthena Autoscaler runs as a sub‑controller of kthena-controller-manager inside the Kubernetes ecosystem. It fetches metrics from the pod /metrics endpoint (e.g., vllm:num_requests_waiting, vllm:kv_cache_usage_perc) and applies user‑defined policies in a closed‑loop fashion.

3. General policy (AutoscalingPolicy)

3.1 Core metrics and tolerance

The policy reads the pod’s /metrics to obtain inference‑specific signals. Users set a targetValue and a tolerancePercent to avoid frequent scaling on minor fluctuations.

3.2 Scaling behavior: Stable vs. Panic mode

Stable Mode : uses a longer stabilization window (e.g., 1 minute) to observe sustained trends and ignore transient spikes.

Panic Mode : triggered when a metric exceeds a high threshold (e.g., >150 % of target), bypassing the window for second‑level scaling.

apiVersion: workload.serving.volcano.sh/v1alpha1
kind: AutoscalingPolicy
metadata:
  name: vllm-queue-policy
spec:
  metrics:
  - metricName: vllm:num_requests_waiting
    targetValue: 100
    tolerancePercent: 50
  behavior:
    scaleUp:
      stablePolicy:
        stabilizationWindow: 1m
        period: 30s
    scaleDown:
      stabilizationWindow: 5m
      period: 1m

3.3 Cost‑aware optimization algorithm

When multiple instance types are involved, the algorithm applies a greedy expansion with a costExpansionRate, sorting capacities by unit cost. This yields two benefits:

Cost efficiency : prefers lower‑cost instances for scaling up.

Cold‑start reduction : keeps the scaling sequence stable, reusing already‑running pods.

4. AutoscalingPolicyBinding (the "what" to scale)

Binding connects a policy to a concrete target, enabling different scaling shapes.

4.1 Binding to ServingGroup – fixed PD ratio scaling

The policy is attached to a ModelServing or its ServingGroup. The autoscaler treats the whole group as a single unit, preserving a predefined role ratio (e.g., prefill : decode = 1 : 2).

# Bind to ModelServing (group scaling)
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: AutoscalingPolicyBinding
metadata:
  name: vllm-group-binding
spec:
  policyRef:
    name: vllm-queue-policy
  homogeneousTarget:
    target:
      targetRef:
        kind: ModelServing
        name: deepseek-serving
    minReplicas: 1
    maxReplicas: 10

4.2 Binding to a Role – heterogeneous PD scaling

Using subTargets, the policy can be bound to a specific role (e.g., only decode), allowing independent scaling of prefill and decode replicas.

# Bind only to decode role
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: AutoscalingPolicyBinding
metadata:
  name: decode-independent-binding
spec:
  policyRef:
    name: llm-scaling-policy
  homogeneousTarget:
    target:
      targetRef:
        kind: ModelServing
        name: deepseek-serving
    subTargets:
      kind: Role
      name: decode
    minReplicas: 2
    maxReplicas: 8

5. Best practices & troubleshooting

Conservative start : use a wide tolerance band (15‑20 %) and a long stable window.

Role‑specific targets : set tighter thresholds for decode in heterogeneous scenarios.

Cost calibration : adjust the cost field according to actual cloud pricing or TCO.

Observability

Kthena Autoscaler exposes the following metrics on /metrics: kthena_autoscaler_desired_replicas – the computed target replica count. kthena_autoscaler_current_replicas – the observed replica count. kthena_autoscaler_scaling_events_total – counter of scaling actions.

6. Advanced: cost‑aware heterogeneous scaling example

In production, GPU resources vary in price and performance. By setting different cost values, the autoscaler prefers low‑cost instances for scaling up while retaining high‑efficiency instances when scaling down.

# Cross‑hardware cost‑optimized binding
apiVersion: workload.serving.volcano.sh/v1alpha1
kind: AutoscalingPolicyBinding
metadata:
  name: heterogeneous-cost-binding
spec:
  policyRef:
    name: vllm-queue-policy
  heterogeneousTarget:
    params:
    - target:
        targetRef:
          kind: ModelServing
          name: deepseek-h100
      cost: 100
      minReplicas: 1
      maxReplicas: 10
    - target:
        targetRef:
          kind: ModelServing
          name: deepseek-a100
      cost: 50
      minReplicas: 1
      maxReplicas: 20
    costExpansionRatePercent: 200

Conclusion

By decoupling scaling logic (Policy) from scaling targets (Binding), Kthena Autoscaler offers great flexibility. Binding to a ServingGroup yields stable, fixed‑ratio scaling, while binding to a Role enables fine‑grained heterogeneous scaling. Combined with its built‑in controller architecture and cost‑aware algorithm, it provides a solid foundation for building efficient, low‑cost LLM inference platforms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

KubernetesautoscalingLLM inferencecost-aware scalingKthena Autoscalerpanic modestable mode
Huawei Cloud Developer Alliance
Written by

Huawei Cloud Developer Alliance

The Huawei Cloud Developer Alliance creates a tech sharing platform for developers and partners, gathering Huawei Cloud product knowledge, event updates, expert talks, and more. Together we continuously innovate to build the cloud foundation of an intelligent world.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.