Deploy Stable Diffusion on Serverless Kubernetes (ASK) with Knative – A Complete Guide
This article explains how Shuhe Technology leveraged Alibaba Cloud's Serverless Kubernetes (ASK) and Knative to deploy, scale, and monitor over 500 AI model services—including Stable Diffusion—achieving 60% cost savings, zero‑to‑scale elasticity, and rapid deployment cycles.
Background and Challenges
Shuhe Technology provides AI model services on a cloud‑native platform. Rapid business growth exposed two major problems: the underlying compute resources could not auto‑scale with request volume, leading to waste and high maintenance costs, and the growing number of model services became difficult to manage.
Solution with Serverless Kubernetes (ASK)
To address these issues, Shuhe adopted Alibaba Cloud's ASK (Serverless Kubernetes) combined with Knative. ASK eliminates the need to manage K8s nodes, automatically creates Pods based on real‑time traffic, and scales them down to zero when idle, resulting in up to 60% cost reduction. ASK’s gray‑release and multi‑version capabilities also simplify model upgrades.
Benefits Achieved
More than 500 AI model services deployed, handling billions of queries daily.
Automatic capacity adjustment matches varying business loads, ensuring stable operation.
Deployment cycle shortened from one day to half a day, boosting development efficiency.
What is Serverless Kubernetes?
Serverless Kubernetes (ASK) provides three core advantages: no‑ops management , automatic elasticity , and pay‑as‑you‑go pricing . It abstracts away K8s complexities such as node lifecycle, networking, and version compatibility, allowing developers to focus on business logic.
Use Case: Deploying Stable Diffusion
Stable Diffusion, a popular generative‑AI model, faces two operational challenges on traditional K8s:
Limited per‑Pod throughput; concurrent requests can overload a single Pod.
Expensive GPU resources that need to be released during low‑traffic periods.
ASK + Knative + MSE solves these by providing precise concurrency‑based scaling, zero‑scale‑to‑zero capability, and multi‑version management.
Step‑by‑Step Deployment
1. Create the Knative service
In the cluster list, select the target cluster knative-sd-demo, navigate to Application > Knative, and use the template to create a service.
2. Apply the following YAML
apiVersion: serving.knative.dev/v1</code><code>kind: Service</code><code>metadata:</code><code> name: knative-sd-demo</code><code> annotations:</code><code> serving.knative.dev.alibabacloud/affinity: "cookie"</code><code> serving.knative.dev.alibabacloud/cookie-name: "sd"</code><code> serving.knative.dev.alibabacloud/cookie-timeout: "1800"</code><code>spec:</code><code> template:</code><code> metadata:</code><code> annotations:</code><code> autoscaling.knative.dev/class: mpa.autoscaling.knative.dev</code><code> autoscaling.knative.dev/maxScale: '10'</code><code> autoscaling.knative.dev/targetUtilizationPercentage: "100"</code><code> k8s.aliyun.com/eci-use-specs: ecs.gn5-c4g1.xlarge,ecs.gn5i-c8g1.2xlarge,ecs.gn5-c8g1.2xlarge</code><code> spec:</code><code> containerConcurrency: 1</code><code> containers:</code><code> - args:</code><code> - --listen</code><code> - --skip-torch-cuda-test</code><code> - --api</code><code> command:</code><code> - python3</code><code> - launch.py</code><code> image: yunqi-registry.cn-shanghai.cr.aliyuncs.com/lab/stable-diffusion@sha256:64999ff1aba706f65a2234d861d46318f7d58e2790b31ace0d567a96e65b617c</code><code> imagePullPolicy: IfNotPresent</code><code> ports:</code><code> - containerPort: 7860</code><code> name: http1</code><code> protocol: TCP</code><code> name: stable-diffusion</code><code> readinessProbe:</code><code> tcpSocket:</code><code> port: 7860</code><code> initialDelaySeconds: 5</code><code> periodSeconds: 1</code><code> failureThreshold: 3Key parameters:
Cookie session affinity via serving.knative.dev.alibabacloud/affinity GPU spec selection via k8s.aliyun.com/eci-use-specs Concurrency control via containerConcurrency 3. Verify deployment – when the service status shows Success, Stable Diffusion is ready.
Deploy Load‑Testing Service (portal‑server)
Use the following YAML to create a deployment that generates traffic against the Stable Diffusion service.
---</code><code>apiVersion: apps/v1</code><code>kind: Deployment</code><code>metadata:</code><code> labels:</code><code> app: portal-server</code><code> name: portal-server</code><code>spec:</code><code> replicas: 1</code><code> selector:</code><code> matchLabels:</code><code> app: portal-server</code><code> template:</code><code> metadata:</code><code> labels:</code><code> app: portal-server</code><code> spec:</code><code> serviceAccountName: portal-server</code><code> containers:</code><code> - name: portal-server</code><code> image: registry-vpc.cn-beijing.aliyuncs.com/acs/sd-yunqi-server:v1.0.2</code><code> env:</code><code> - name: MAX_CONCURRENT_REQUESTS</code><code> value: "5"</code><code> - name: POD_NAMESPACE</code><code> value: "default"</code><code> readinessProbe:</code><code> failureThreshold: 3</code><code> periodSeconds: 1</code><code> tcpSocket:</code><code> port: 8080</code><code>---</code><code>apiVersion: v1</code><code>kind: Service</code><code>metadata:</code><code> annotations:</code><code> service.beta.kubernetes.io/alibaba-cloud-loadbalancer-address-type: internet</code><code> service.beta.kubernetes.io/alibaba-cloud-loadbalancer-instance-charge-type: PayByCLCU</code><code> name: portal-server</code><code>spec:</code><code> externalTrafficPolicy: Local</code><code> ports:</code><code> - name: http-80</code><code> port: 80</code><code> targetPort: 8080</code><code> - name: http-8888</code><code> port: 8888</code><code> targetPort: 8888</code><code> selector:</code><code> app: portal-server</code><code> type: LoadBalancerAfter creation, retrieve the external IP (e.g., 123.56.xx.xx) and access http://123.56.xx.xx to open the Stable Diffusion UI.
Set the concurrency to 5 and total requests to 20, then start the load test. The dashboard shows five Pods handling requests, each generating an image displayed on the page.
Observability
ASK’s built‑in monitoring provides a dashboard showing request volume, success rate, error codes, and auto‑scaling trends. The response‑time panel displays P50, P90, P95, and P99 latency metrics.
Conclusion
By leveraging ASK’s Knative‑based precise concurrency scaling, zero‑scale‑to‑zero, and multi‑version management, enterprises can efficiently deploy AI services like Stable Diffusion. The hands‑on scenario is available on Alibaba Cloud’s developer portal for further experimentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
