Cloud Native 19 min read

How Knative Cuts AI Service Costs by 60% and Halves Deployment Time

This article explains how Shuhe Tech combined Knative with AI workloads to achieve 60% resource cost savings and reduce model deployment cycles from one day to half a day, detailing Knative's architecture, request‑based autoscaling, multi‑version releases, and advanced scaling features.

Alibaba Cloud Native

Oct 20, 2023

How Knative Cuts AI Service Costs by 60% and Halves Deployment Time

Background

AI services in finance require heavy compute and frequent model iteration, leading to high resource consumption. Deploying multiple model versions simultaneously increases cost.

Knative Overview

Knative is an open‑source serverless framework on Kubernetes providing request‑driven autoscaling, scaling to zero, and traffic‑based rollout.

Core modules: Serving (lightweight hosting) and Eventing (Broker/Trigger model).

Knative Service consists of Configuration (defines workload, each change creates a Revision ) and Route (traffic routing).

Traffic‑Based Gray Release

When a new Revision (e.g., V2) is created, traffic can be split between revisions (e.g., 70 % V1, 30 % V2). Adjusting the split validates the new version and enables instant rollback. Tags can be attached to revisions for canary testing via direct URLs.

Request‑Based Autoscaling (KPA)

Knative injects a queue‑proxy sidecar into each pod to collect request‑level metrics (concurrency or RPS). The autoscaler reads these metrics and computes the required pod count.

Pod count formula:

POD数 = 并发请求总数 / (Pod最大并发数 * 目标使用率)

Two scaling modes:

Stable mode : uses a 60‑second window to calculate average concurrency.

Panic mode : uses a shorter window (default 6 seconds). If the panic‑mode pod count exceeds twice the current ready pods, panic scaling is applied.

When traffic drops to zero, KPA switches the pod access mode to Proxy , allowing the deployment to scale down to zero.

Cold‑Start Mitigation

Configure a termination‑delay and retain period before scaling to zero.

Lower the target utilization percentage to keep extra pods warm.

Advanced Scaling Features

Reserved Resource Pool : combines steady‑state ECS nodes with burst‑capacity ECI nodes, enabling cost‑effective scaling during traffic spikes.

Precise Concurrency Control : limits per‑pod request concurrency, essential for GPU‑intensive AIGC models.

Advanced Horizontal Pod Autoscaler (AHPA) : predicts future load from historical metrics to pre‑scale resources.

Event‑Driven Integration

Knative Eventing is integrated with Alibaba Cloud EventBridge, providing a reliable event bus for production‑grade event delivery.

End‑to‑End Deployment Workflow

Model artifacts are stored in BetterCDS, which creates a versioned package.

A CI pipeline builds a Docker image from the model and pushes it to a registry.

The pipeline updates a Knative Service, creating a new Revision and Deployment.

After all pods become ready, the pipeline tags the Revision and adds a Route entry for traffic splitting.

This Configuration‑Revision mapping enables simultaneous serving of multiple model versions and rapid traffic switching.

Cluster Scaling Architecture

Workloads run on Alibaba Cloud Container Service for Kubernetes (ACK) with a hybrid node pool: long‑running ECS nodes for baseline traffic and virtual nodes (elastic instances) for burst capacity. The architecture provides high elasticity while minimizing idle resource costs.

Observed Results

Deploying >500 AI model services on Knative reduced resource costs by approximately 60 % and halved average deployment time (from one day to half a day). Peak pod count reached ~2000; the system automatically scaled to zero during idle periods, achieving serverless efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud native Serverless AI Kubernetes Knative KPA

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.