Cloud Native 19 min read

Best Practices for Deploying AI Model Inference on Knative

This guide explains how to efficiently deploy AI model inference services on Knative by externalizing model data, using Fluid for accelerated loading, configuring secrets, ImageCache, graceful shutdown, probes, autoscaling parameters, mixed ECS/ECI resources, shared GPU scheduling, and observability features to achieve fast scaling, low cost, and high elasticity.

Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Alibaba Cloud Infrastructure
Best Practices for Deploying AI Model Inference on Knative

Knative combined with AI provides rapid deployment, high elasticity, and low‑cost advantages, especially for workloads that frequently change compute resources such as model inference. To maximize these benefits, avoid packaging the AI model inside the container image; instead store the model in external storage (OSS, NAS) and mount it via a PVC.

Use Fluid (JindoRuntime) to accelerate model loading. Example YAML for a secret and a Fluid Dataset:

apiVersion: v1
kind: Secret
metadata:
  name: access-key
stringData:
  fs.oss.accessKeyId: your_ak_id # replace with your OSS access key ID
  fs.oss.accessKeySecret: your_ak_skrt # replace with your OSS access key secret
---
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: oss-data
spec:
  mounts:
  - mountPoint: "oss://{Bucket}/{path-to-model}" # replace with your bucket and path
    name: xxx
    path: "{path-to-model}" # path used by the application
    options:
      fs.oss.endpoint: "oss-cn-beijing.aliyuncs.com" # replace with actual endpoint
    encryptOptions:
    - name: fs.oss.accessKeyId
      valueFrom:
        secretKeyRef:
          name: access-key
          key: fs.oss.accessKeyId
    - name: fs.oss.accessKeySecret
      valueFrom:
        secretKeyRef:
          name: access-key
          key: fs.oss.accessKeySecret
  accessModes:
  - ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: oss-data
spec:
  replicas: 2
  tieredstore:
    levels:
    - mediumtype: SSD
      volumeType: emptyDir
      path: /mnt/ssd0/cache
      quota: 100Gi
      high: "0.95"
      low: "0.7"
  fuse:
    properties:
      fs.jindofsx.data.cache.enable: "true"
    args:
    - -okernel_cache
    - -oro
    - -oattr_timeout=7200
    - -oentry_timeout=7200
    - -ometrics_port=9089
    cleanPolicy: OnDemand

After declaring the Dataset and JindoRuntime, a PVC with the same name is created and can be mounted into a Knative Service:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: sd-backend
spec:
  template:
    spec:
      containers:
      - image:
name: image-name
        ports:
        - containerPort: xxx
          protocol: TCP
        volumeMounts:
        - mountPath: /data/models # path used by the program
          name: data-volume
      volumes:
      - name: data-volume
        persistentVolumeClaim:
          claimName: oss-data

Because AI containers are large (CUDA, PyTorch‑GPU, etc.), use ImageCache to speed up image pulling for ECI Pods:

apiVersion: eci.alibabacloud.com/v1
kind: ImageCache
metadata:
  name: imagecache-ai-model
  annotations:
    k8s.aliyun.com/eci-image-cache: "true" # enable cache reuse
spec:
  images:
  -
imageCacheSize: 25 # GiB
  retentionDays: 7

Graceful shutdown is essential: on SIGTERM set the container to non‑ready, and configure timeoutSeconds to 1.2× the longest expected request time (e.g., 6 seconds for a 5‑second request).

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    spec:
      timeoutSeconds: 6

Knative probes differ from Kubernetes probes; they are more aggressive to reduce cold‑start latency. Define readiness and liveness probes in the Service spec as needed.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: runtime
  namespace: default
spec:
  template:
    spec:
      containers:
      - name: first-container
        image:
ports:
        - containerPort: 8080
        readinessProbe:
          httpGet:
            port: 8080
            path: "/health"
        livenessProbe:
          tcpSocket:
            port: 8080

Autoscaling defaults to concurrency metric; configure windows, panic windows, min/max/initial scale, and target utilization to match workload characteristics. Example snippets:

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/metric: "concurrency"
        autoscaling.knative.dev/window: "50s"
        autoscaling.knative.dev/panic-window-percentage: "10.0"
        autoscaling.knative.dev/min-scale: "1"
        autoscaling.knative.dev/max-scale: "3"
        autoscaling.knative.dev/initial-scale: "5"
        autoscaling.knative.dev/target: 1
        autoscaling.knative.dev/target-utilization-percentage: "90"

For bursty traffic, mix ECS and ECI resources using a ResourcePolicy so that normal load runs on ECS and spikes are handled by ECI.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: xxx
  namespace: xxx
spec:
  selector:
    serving.knative.dev/service: helloworld-go
  strategy: prefer
  units:
  - resource: ecs
    max: 10
    nodeSelector:
      key2: value2
  - resource: ecs
    nodeSelector:
      key3: value3
  - resource: eci

Enable shared GPU scheduling by setting the aliyun.com/gpu-mem limit in the container spec.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  name: helloworld-go
  namespace: default
spec:
  template:
    spec:
      containerConcurrency: 1
      containers:
      - image: registry-vpc.cn-hangzhou.aliyuncs.com/demo-test/test:helloworld-go
        name: user-container
        ports:
        - containerPort: 6666
          name: http1
          protocol: TCP
        resources:
          limits:
            aliyun.com/gpu-mem: "3"

Observability is provided via Queue‑Proxy logs and a built‑in Prometheus dashboard. Enable request logging with a ConfigMap:

apiVersion: v1
kind: ConfigMap
metadata:
  name: config-observability
  namespace: knative-serving
data:
  logging.enable-request-log: "true"
  logging.request-log-template: '{"httpRequest": {"requestMethod": "{{.Request.Method}}", "requestUrl": "{{js .Request.RequestURI}}", "requestSize": "{{.Request.ContentLength}}", "status": {{.Response.Code}}, "responseSize": "{{.Response.Size}}", "userAgent": "{{js .Request.UserAgent}}", "remoteIp": "{{js .Request.RemoteAddr}}", "serverIp": "{{.Revision.PodIP}}", "referer": "{{js .Request.Referer}}", "latency": "{{.Response.Latency}}s", "revisionName": "{{.Revision.Name}}", "protocol": "{{.Request.Proto}}"}, "traceId": "{{index .Request.Header \"X-B3-Traceid\"}}"}'

The dashboard shows request count, success rate, scaling trends, latency, concurrency, and resource usage, allowing fine‑tuning of autoscaling and concurrency settings based on real‑world performance.

In summary, combining Knative’s serverless capabilities with AI workloads yields fast, elastic, and cost‑effective model serving; following the above best practices ensures optimal resource utilization, smooth rollouts, and robust observability.

Cloud NativeserverlessautoscalingBest PracticesGPUKnativeAI Model Inference
Alibaba Cloud Infrastructure
Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.