Cloud Native 13 min read

Deploy Stable Diffusion on Serverless Kubernetes (ASK) with Knative – A Complete Guide

This article explains how Shuhe Technology leveraged Alibaba Cloud's Serverless Kubernetes (ASK) and Knative to deploy, scale, and monitor over 500 AI model services—including Stable Diffusion—achieving 60% cost savings, zero‑to‑scale elasticity, and rapid deployment cycles.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Deploy Stable Diffusion on Serverless Kubernetes (ASK) with Knative – A Complete Guide

Background and Challenges

Shuhe Technology provides AI model services on a cloud‑native platform. Rapid business growth exposed two major problems: the underlying compute resources could not auto‑scale with request volume, leading to waste and high maintenance costs, and the growing number of model services became difficult to manage.

Solution with Serverless Kubernetes (ASK)

To address these issues, Shuhe adopted Alibaba Cloud's ASK (Serverless Kubernetes) combined with Knative. ASK eliminates the need to manage K8s nodes, automatically creates Pods based on real‑time traffic, and scales them down to zero when idle, resulting in up to 60% cost reduction. ASK’s gray‑release and multi‑version capabilities also simplify model upgrades.

Benefits Achieved

More than 500 AI model services deployed, handling billions of queries daily.

Automatic capacity adjustment matches varying business loads, ensuring stable operation.

Deployment cycle shortened from one day to half a day, boosting development efficiency.

What is Serverless Kubernetes?

Serverless Kubernetes (ASK) provides three core advantages: no‑ops management , automatic elasticity , and pay‑as‑you‑go pricing . It abstracts away K8s complexities such as node lifecycle, networking, and version compatibility, allowing developers to focus on business logic.

Use Case: Deploying Stable Diffusion

Stable Diffusion, a popular generative‑AI model, faces two operational challenges on traditional K8s:

Limited per‑Pod throughput; concurrent requests can overload a single Pod.

Expensive GPU resources that need to be released during low‑traffic periods.

ASK + Knative + MSE solves these by providing precise concurrency‑based scaling, zero‑scale‑to‑zero capability, and multi‑version management.

Step‑by‑Step Deployment

1. Create the Knative service

In the cluster list, select the target cluster knative-sd-demo, navigate to Application > Knative, and use the template to create a service.

2. Apply the following YAML

apiVersion: serving.knative.dev/v1</code><code>kind: Service</code><code>metadata:</code><code>  name: knative-sd-demo</code><code>  annotations:</code><code>    serving.knative.dev.alibabacloud/affinity: "cookie"</code><code>    serving.knative.dev.alibabacloud/cookie-name: "sd"</code><code>    serving.knative.dev.alibabacloud/cookie-timeout: "1800"</code><code>spec:</code><code>  template:</code><code>    metadata:</code><code>      annotations:</code><code>        autoscaling.knative.dev/class: mpa.autoscaling.knative.dev</code><code>        autoscaling.knative.dev/maxScale: '10'</code><code>        autoscaling.knative.dev/targetUtilizationPercentage: "100"</code><code>        k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge,ecs.gn5i-c8g1.2xlarge,ecs.gn5-c8g1.2xlarge</code><code>    spec:</code><code>      containerConcurrency: 1</code><code>      containers:</code><code>      - args:</code><code>        - --listen</code><code>        - --skip-torch-cuda-test</code><code>        - --api</code><code>        command:</code><code>        - python3</code><code>        - launch.py</code><code>        image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion@sha256:64999ff1aba706f65a2234d861d46318f7d58e2790b31ace0d567a96e65b617c</code><code>        imagePullPolicy: IfNotPresent</code><code>        ports:</code><code>        - containerPort: 7860</code><code>          name: http1</code><code>          protocol: TCP</code><code>        name: stable-diffusion</code><code>        readinessProbe:</code><code>          tcpSocket:</code><code>            port: 7860</code><code>          initialDelaySeconds: 5</code><code>          periodSeconds: 1</code><code>          failureThreshold: 3

Key parameters:

Cookie session affinity via serving.knative.dev.alibabacloud/affinity GPU spec selection via k8s.aliyun.com/eci-use-specs Concurrency control via containerConcurrency 3. Verify deployment – when the service status shows Success, Stable Diffusion is ready.

Deploy Load‑Testing Service (portal‑server)

Use the following YAML to create a deployment that generates traffic against the Stable Diffusion service.

---</code><code>apiVersion: apps/v1</code><code>kind: Deployment</code><code>metadata:</code><code>  labels:</code><code>    app: portal-server</code><code>  name: portal-server</code><code>spec:</code><code>  replicas: 1</code><code>  selector:</code><code>    matchLabels:</code><code>      app: portal-server</code><code>  template:</code><code>    metadata:</code><code>      labels:</code><code>        app: portal-server</code><code>    spec:</code><code>      serviceAccountName: portal-server</code><code>      containers:</code><code>        - name: portal-server</code><code>          image: registry-vpc.cn-beijing.aliyuncs.com/acs/sd-yunqi-server:v1.0.2</code><code>          env:</code><code>            - name: MAX_CONCURRENT_REQUESTS</code><code>              value: "5"</code><code>            - name: POD_NAMESPACE</code><code>              value: "default"</code><code>          readinessProbe:</code><code>            failureThreshold: 3</code><code>            periodSeconds: 1</code><code>            tcpSocket:</code><code>              port: 8080</code><code>---</code><code>apiVersion: v1</code><code>kind: Service</code><code>metadata:</code><code>  annotations:</code><code>    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet</code><code>    service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU</code><code>  name: portal-server</code><code>spec:</code><code>  externalTrafficPolicy: Local</code><code>  ports:</code><code>    - name: http-80</code><code>      port: 80</code><code>      targetPort: 8080</code><code>    - name: http-8888</code><code>      port: 8888</code><code>      targetPort: 8888</code><code>  selector:</code><code>    app: portal-server</code><code>  type: LoadBalancer

After creation, retrieve the external IP (e.g., 123.56.xx.xx) and access http://123.56.xx.xx to open the Stable Diffusion UI.

Set the concurrency to 5 and total requests to 20, then start the load test. The dashboard shows five Pods handling requests, each generating an image displayed on the page.

Observability

ASK’s built‑in monitoring provides a dashboard showing request volume, success rate, error codes, and auto‑scaling trends. The response‑time panel displays P50, P90, P95, and P99 latency metrics.

Conclusion

By leveraging ASK’s Knative‑based precise concurrency scaling, zero‑scale‑to‑zero, and multi‑version management, enterprises can efficiently deploy AI services like Stable Diffusion. The hands‑on scenario is available on Alibaba Cloud’s developer portal for further experimentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeServerlessKubernetesStable DiffusionAI deploymentKnativeASK
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.