Best Practices for Deploying AI Model Inference on Knative
This guide explains how to efficiently deploy AI model inference services on Knative by externalizing model data, using Fluid for accelerated loading, configuring secrets, ImageCache, graceful shutdown, probes, autoscaling parameters, mixed ECS/ECI resources, shared GPU scheduling, and observability features to achieve fast scaling, low cost, and high elasticity.
Knative combined with AI provides rapid deployment, high elasticity, and low‑cost advantages, especially for workloads that frequently change compute resources such as model inference. To maximize these benefits, avoid packaging the AI model inside the container image; instead store the model in external storage (OSS, NAS) and mount it via a PVC.
Use Fluid (JindoRuntime) to accelerate model loading. Example YAML for a secret and a Fluid Dataset:
apiVersion: v1
kind: Secret
metadata:
name: access-key
stringData:
fs.oss.accessKeyId: your_ak_id # replace with your OSS access key ID
fs.oss.accessKeySecret: your_ak_skrt # replace with your OSS access key secret
---
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: oss-data
spec:
mounts:
- mountPoint: "oss://{Bucket}/{path-to-model}" # replace with your bucket and path
name: xxx
path: "{path-to-model}" # path used by the application
options:
fs.oss.endpoint: "oss-cn-beijing.aliyuncs.com" # replace with actual endpoint
encryptOptions:
- name: fs.oss.accessKeyId
valueFrom:
secretKeyRef:
name: access-key
key: fs.oss.accessKeyId
- name: fs.oss.accessKeySecret
valueFrom:
secretKeyRef:
name: access-key
key: fs.oss.accessKeySecret
accessModes:
- ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
name: oss-data
spec:
replicas: 2
tieredstore:
levels:
- mediumtype: SSD
volumeType: emptyDir
path: /mnt/ssd0/cache
quota: 100Gi
high: "0.95"
low: "0.7"
fuse:
properties:
fs.jindofsx.data.cache.enable: "true"
args:
- -okernel_cache
- -oro
- -oattr_timeout=7200
- -oentry_timeout=7200
- -ometrics_port=9089
cleanPolicy: OnDemandAfter declaring the Dataset and JindoRuntime, a PVC with the same name is created and can be mounted into a Knative Service:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: sd-backend
spec:
template:
spec:
containers:
- image:
name: image-name
ports:
- containerPort: xxx
protocol: TCP
volumeMounts:
- mountPath: /data/models # path used by the program
name: data-volume
volumes:
- name: data-volume
persistentVolumeClaim:
claimName: oss-dataBecause AI containers are large (CUDA, PyTorch‑GPU, etc.), use ImageCache to speed up image pulling for ECI Pods:
apiVersion: eci.alibabacloud.com/v1
kind: ImageCache
metadata:
name: imagecache-ai-model
annotations:
k8s.aliyun.com/eci-image-cache: "true" # enable cache reuse
spec:
images:
-
imageCacheSize: 25 # GiB
retentionDays: 7Graceful shutdown is essential: on SIGTERM set the container to non‑ready, and configure timeoutSeconds to 1.2× the longest expected request time (e.g., 6 seconds for a 5‑second request).
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
spec:
timeoutSeconds: 6Knative probes differ from Kubernetes probes; they are more aggressive to reduce cold‑start latency. Define readiness and liveness probes in the Service spec as needed.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: runtime
namespace: default
spec:
template:
spec:
containers:
- name: first-container
image:
ports:
- containerPort: 8080
readinessProbe:
httpGet:
port: 8080
path: "/health"
livenessProbe:
tcpSocket:
port: 8080Autoscaling defaults to concurrency metric; configure windows, panic windows, min/max/initial scale, and target utilization to match workload characteristics. Example snippets:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/metric: "concurrency"
autoscaling.knative.dev/window: "50s"
autoscaling.knative.dev/panic-window-percentage: "10.0"
autoscaling.knative.dev/min-scale: "1"
autoscaling.knative.dev/max-scale: "3"
autoscaling.knative.dev/initial-scale: "5"
autoscaling.knative.dev/target: 1
autoscaling.knative.dev/target-utilization-percentage: "90"For bursty traffic, mix ECS and ECI resources using a ResourcePolicy so that normal load runs on ECS and spikes are handled by ECI.
apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
name: xxx
namespace: xxx
spec:
selector:
serving.knative.dev/service: helloworld-go
strategy: prefer
units:
- resource: ecs
max: 10
nodeSelector:
key2: value2
- resource: ecs
nodeSelector:
key3: value3
- resource: eciEnable shared GPU scheduling by setting the aliyun.com/gpu-mem limit in the container spec.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: helloworld-go
namespace: default
spec:
template:
spec:
containerConcurrency: 1
containers:
- image: registry-vpc.cn-hangzhou.aliyuncs.com/demo-test/test:helloworld-go
name: user-container
ports:
- containerPort: 6666
name: http1
protocol: TCP
resources:
limits:
aliyun.com/gpu-mem: "3"Observability is provided via Queue‑Proxy logs and a built‑in Prometheus dashboard. Enable request logging with a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: config-observability
namespace: knative-serving
data:
logging.enable-request-log: "true"
logging.request-log-template: '{"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "revisionName": "{{.Revision.Name}}", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header \"X-B3-Traceid\"}}"}'The dashboard shows request count, success rate, scaling trends, latency, concurrency, and resource usage, allowing fine‑tuning of autoscaling and concurrency settings based on real‑world performance.
In summary, combining Knative’s serverless capabilities with AI workloads yields fast, elastic, and cost‑effective model serving; following the above best practices ensures optimal resource utilization, smooth rollouts, and robust observability.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.