How Kubernetes Powers Scalable AI: Building an End‑to‑End Machine Learning Platform
This article explores how Kubernetes, enhanced by KubeSphere and serverless technologies, enables efficient AI workloads through GPU virtualization, multi‑cluster management, secure data sandboxes, automated testing, and scalable storage, illustrating a complete lifecycle from data ingestion to model inference.
Artificial Intelligence and Kubernetes
Predictions for 2021 consistently listed the tighter integration of AI with Kubernetes as a top trend because Kubernetes offers excellent scalability, distributed architecture, and powerful scheduling, making it an ideal platform for deep‑learning and machine‑learning workloads.
Prophecis Architecture
The Prophecis platform from WeBank runs on top of Kubernetes (purple layer) with a container‑management layer (KubeSphere) providing storage, networking, service governance, CI/CD, and observability.
Missing Native Capabilities for AI Workloads
User management and multi‑tenant permissions
Multi‑cluster management
Graphical GPU workload scheduling
GPU monitoring
Training and inference log management
Kubernetes events and audit
Alerting and notifications
Kubernetes alone does not provide these enterprise‑grade features, which are essential for a production‑ready ML platform.
KubeSphere as an Enterprise‑Grade Extension
KubeSphere sits on top of Kubernetes and adds user management, multi‑cluster control, observability, application management, micro‑service governance, and CI/CD, effectively turning Kubernetes into a modern distributed operating system.
Building the Jizhan AI Platform
The platform offers end‑to‑end AI lifecycle management: data processing, model training, testing, and inference, with low‑code development, automated testing, intelligent scheduling, and resource monitoring to boost efficiency and reduce costs.
Challenges Before Refactoring
Low GPU utilization during development (average 50% waste).
High storage operational cost with Ceph.
Data‑set security for confidential data.
High manual effort for algorithm testing.
Solutions Implemented
Adopted KubeSphere to abstract Kubernetes complexities.
Replaced Ceph with QingStor NeonSAN (NVMe SSD + 25 GbE RDMA) achieving 5‑6× IOPS improvement.
Implemented a data‑security sandbox to isolate datasets while allowing algorithm training.
Developed EVSdk for unified algorithm packaging, input standardization, and automated testing.
GPU Virtualization
Used Tencent’s open‑source GPUManager to virtualize GPUs, limiting each container’s GPU usage with only ~5% performance overhead and enabling multiple containers to share a single GPU safely.
resources:
requests:
nvidia.com/gpu: 2
cpu: 8
memory: 16Gi
limits:
nvidia.com/gpu: 2
cpu: 8
memory: 16GiTraining Cluster Scheduling
Jobs are created with explicit GPU requests; combined with a message queue, the training cluster achieves near‑full GPU utilization.
Resource Monitoring
KubeSphere’s custom monitoring panels track CPU, GPU, and project‑level usage, allowing administrators to set quotas per project and per user.
Secure Data Sandbox
The sandbox isolates external clusters from the internet, preventing data leakage while permitting controlled data transfer to developer environments via network policies.
Automated Testing Framework
EVSdk defines a unified algorithm interface, standardizes inputs, and supports multiple model formats. Templates and routing paths extract specific fields (e.g., age) from JSON/XML outputs for comparison against ground‑truth annotations.
route_path: $.people[0].age.valueServerless for AI
AI workloads benefit from serverless by reducing data‑processing costs, triggering training jobs on events, serving models as functions, and handling inference results via event‑driven functions.
OpenFunction Overview
OpenFunction is an open‑source cloud‑native FaaS platform built on top of Kubernetes. It consists of Build (converts code to container images), Serving (scalable function execution), and Events (connects external event sources).
OpenFunction leverages Cloud Native Buildpacks, Dapr, Knative Serving, and KEDA to provide both synchronous and asynchronous function runtimes, with extensible event sources (Kafka, NATS, PubSub, S3, GitHub) and customizable event buses.
EventSource: integrates external event producers.
EventBus: pluggable message‑queue backbone.
Trigger: filters events and invokes functions.
Future Outlook
The roadmap includes tighter GPU scheduling support in KubeSphere v3.2, a plug‑in architecture in v4.0, and industry‑specific low‑code suites that let end users adapt algorithms to their own data without writing code.
Qingyun Technology Community
Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
