How ACK Inference Gateway Tripled Large‑Model Performance for an Insurance Giant

This article details how Guotai Insurance tackled the high latency and cost of large‑model inference by deploying Alibaba Cloud's ACK Inference Gateway, which uses load‑aware, prefix‑aware routing, intelligent queuing, and comprehensive observability to boost efficiency threefold while reducing expenses.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
How ACK Inference Gateway Tripled Large‑Model Performance for an Insurance Giant

Introduction

At the 2025 Cloud Expo, senior operations engineer Li Chunke from Guotai Property Insurance and senior R&D engineer Yin Hang from Alibaba Cloud presented the case study "Large‑Model Inference Traffic Scheduling Optimization Practice". The talk highlighted Guotai's success as an Alibaba Cloud user and examined the core capabilities and business value of the ACK Inference Gateway.

From Guotai Property Insurance Practice

Guotai Property Insurance, founded in 2008, pursues a "technology‑driven insurance" strategy and actively applies large‑model AI to underwriting, claims, and risk control. However, deploying private inference services introduced dual pressures of efficiency and cost, especially when handling sensitive personal data that requires on‑premise de‑identification before calling commercial models.

AI Inference Service Challenges

Unlike traditional network traffic, generative AI inference has high request latency (seconds to minutes), strong resource binding (single requests can occupy 100% of a GPU), frequent queuing, and unpredictable load per request.

Traditional Network Traffic vs. GenAI/LLM Traffic

Small request/response bodies, millisecond processing, parallel handling, polling or utilization‑based traffic management.

High latency, GPU‑bound, queuing, and highly variable backend load for generative AI.

Specific Business Pain Points

When Guotai deployed the Qwen 2.5‑7B visual model via a traditional application load balancer, they encountered:

Inference latency up to 600 seconds, severely impacting business efficiency.

Uneven backend node load: some nodes queued requests while others remained idle, driving high costs.

Black‑box inference: only GPU utilization was visible, preventing fine‑grained performance analysis.

Solution

Guotai chose Alibaba Cloud's ACK Inference Gateway, a traffic‑management component built on the standard Gateway API and designed for generative AI workloads. It provides load‑aware routing, prefix‑aware load balancing, intelligent queuing, unified observability, and security features.

ACK AI Profiling

To address the black‑box problem, ACK offers a full‑stack, multi‑dimensional AI profiling system that captures GPU metrics, CPU usage, Python calls, CUDA kernels, system calls, and Torch‑Profiler data with near‑zero overhead. This enables precise identification of why GPU utilization is low and guides traffic‑distribution improvements.

Core Capabilities of the ACK Inference Gateway

Request queue length monitoring.

GPU KV‑Cache utilization tracking.

LoRA model loading status.

Load‑Aware Load Balancing

Instead of evenly distributing requests, the gateway evaluates real‑time internal metrics (queue length, KV‑Cache usage, LoRA status) and routes new requests to the least‑loaded node, reducing inference time from 600 seconds to 180 seconds—a three‑fold improvement.

Prefix‑Aware Load Balancing

Two modes are offered:

Estimation mode : Analyzes request prefixes and routes similar‑prefix requests to the same pod without requiring engine‑specific support.

Precise mode : Consumes real KV‑Cache block distribution via ZeroMQ from vLLM v0.10.0+, achieving higher cache‑hit rates and greater performance gains.

Both modes enable use cases such as long‑document queries and multi‑turn conversations.

Model and Infrastructure Canary Releases

The gateway supports traffic‑splitting via HTTPRoute, allowing seamless upgrades of hardware (e.g., A10 → L20) or model versions (e.g., Qwen 2.5 → Qwen 3) without service disruption.

Traffic Observability

ACK provides Prometheus metrics (TTFT, TPOT, KV‑Cache utilization, token rate) and enriched access logs (input/output token usage, model names, request/response times) that conform to OpenTelemetry Gen AI standards, enabling fine‑grained performance monitoring and cost accounting.

Intelligent Request Queuing

The gateway can proactively queue requests when backend nodes approach saturation, intelligently dequeue to newly ready nodes, and prioritize high‑value models, preventing overload and reducing wasted GPU cycles.

Other Features

Unified management of private model deployments (InferencePool) and external MaaS services (Backend API).

Automatic HTTP→HTTPS upgrades, model/stream configuration, and API‑KEY handling for external services.

LLM Request Security

The gateway integrates with Alibaba Cloud Content Security to inspect and reject non‑compliant requests or responses, providing a unified security layer for both private and MaaS inference services.

Business Value Delivered

By adopting the ACK Inference Gateway, Guotai achieved a three‑fold reduction in inference latency, higher GPU utilization, and significant cost savings, demonstrating that the solution can scale large‑model services efficiently in the insurance sector.

Conclusion

The ACK Inference Gateway addresses the unique challenges of generative AI workloads through load‑aware, prefix‑aware, and intelligent queuing mechanisms, unified observability, and security, providing a practical path for large‑model deployment in regulated industries.

References

Intelligent routing and traffic management with ACK Gateway Inference Extension: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/intelligent-routing-and-traffic-management-with-ack-gateway-inference-extension

Prefix‑aware load balancing using intelligent inference routing: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/prefix-aware-load-balancing-using-intelligent-inference-routing

Blog on KV‑Cache benefits: https://llm-d.ai/blog/kvcache-wins-you-can-see

Canary release of generated AI inference service: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/use-ack-gateway-with-inference-extension-to-implement-canary-release-of-generated-ai-inference-service

Observing generative AI requests: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/monitor-generative-ai-requests?spm=a2c4g.11186623.help-menu-85222.d_2_3_5_1_13.9427144cVtV3XJ

Request queueing and priority scheduling: https://help.aliyun.com/zh/ack/ack-managed-and-ack-dedicated/user-guide/intelligent-inference-routing-with-request-queueing-and-priority-scheduling

cloud-nativelarge language modelsTraffic SchedulingAI inferenceACK Gateway
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.