How to Build a GPU Spot‑Pool Operator on Kubernetes with Kubebuilder
This guide walks through creating a Kubernetes Operator using Kubebuilder to manage a GPU spot‑pool on Tencent Cloud, covering CRD design, controller logic, code generation, and deployment steps, enabling automated scaling of GPU resources for AI workloads while illustrating core Cloud‑Native concepts.
This article combines Kubernetes, AI, and cloud to develop an AI tool that manages a GPU resource pool via a custom Kubernetes Operator.
Operator Overview
A Kubernetes Operator extends the platform with declarative APIs, allowing users to manage complex applications through custom resources. The Operator watches events and reconciles the actual state to match the desired state defined in a CustomResourceDefinition (CRD).
CRD Design
The CRD describes the desired configuration of the GPU spot‑pool, including minimum/maximum instance counts and cloud provider parameters.
apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
name: postgresqls.acid.zalan.do
spec:
group: acid.zalan.do
names:
kind: postgresql
listKind: postgresqlList
plural: postgresqls
singular: postgresql
scope: Namespaced
versions:
- name: v1
served: true
storage: true
additionalPrinterColumns:
- name: Team
type: string
description: Team responsible for Postgres Cluster
JSONPath: .spec.teamId
- name: Version
type: string
description: PostgreSQL version
JSONPath: .spec.postgresql.version
- name: Pods
type: integer
description: Number of Pods per Postgres cluster
JSONPath: .spec.numberOfInstances
- name: Volume
type: string
description: Size of the bound volume
JSONPath: .spec.volume.sizeThe CRD consists of apiVersion, kind, metadata, and spec. apiVersion identifies the API group and version, while kind specifies the resource type.
Architecture
The GPU resource pool uses Tencent Cloud spot instances. The Operator runs inside the Kubernetes cluster, monitors the pool size, and automatically adds or removes instances to match the desired count.
Prerequisites
A functional Kubernetes cluster (kind, kubeadm, etc.)
kubebuilder installed
Tencent Cloud Access Key (AK) and Secret Key
Quick Start
1. Design CRD
The CRD for the spot‑pool includes fields such as region, instance type, subnet, VPC, security groups, image ID, and charge type.
apiVersion: devops.jokerbai.com/v1
kind: Spotpool
metadata:
name: spotpool-sample
spec:
secretId: <your-secret-id>
secretKey: <your-secret-key>
region: ap-singapore
availabilityZone: ap-singapore-2
instanceType: GN7.2XLARGE32
minimum: 2
maximum: 2
subnetId: DEFAULT
vpcId: DEFAULT
securityGroupIds:
- sg-xxx
imageId: img-xxx
instanceChargeType: SPOTPAID2. Initialize Project
mkdir spotpool && cd spotpool
kubebuilder init \
--domain jokerbai.com \
--repo github.com/joker-bai/spotpool \
--project-name spotpool \
--plugins go/v4 \
--owner "Joker Bai"
kubebuilder create api --group devops.jokerbai.com --version v1 --kind SpotpoolThe generated directory structure includes api/v1, config, controllers, and other scaffolding files.
3. CRD Development
(1) Define API
package v1
import (
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
type SpotpoolSpec struct {
SecretId string `json:"secretId,omitempty"`
SecretKey string `json:"secretKey,omitempty"`
Region string `json:"region,omitempty"`
AvailabilityZone string `json:"availabilityZone,omitempty"`
InstanceType string `json:"instanceType,omitempty"`
Minimum int32 `json:"minimum,omitempty"`
Maximum int32 `json:"maximum,omitempty"`
SubnetId string `json:"subnetId,omitempty"`
VpcId string `json:"vpcId,omitempty"`
SecurityGroupIds []string `json:"securityGroupIds,omitempty"`
ImageId string `json:"imageId,omitempty"`
InstanceChargeType string `json:"instanceChargeType,omitempty"`
}
type SpotpoolStatus struct {
Size int32 `json:"size,omitempty"`
Conditions []metav1.Condition `json:"conditions,omitempty"`
Instances []Instances `json:"instances,omitempty"`
}
type Instances struct {
InstanceId string `json:"instanceId,omitempty"`
PublicIp string `json:"publicIp,omitempty"`
}
//+kubebuilder:object:root=true
//+kubebuilder:subresource:status
type Spotpool struct {
metav1.TypeMeta `json:",inline"`
metav1.ObjectMeta `json:"metadata,omitempty"`
Spec SpotpoolSpec `json:"spec,omitempty"`
Status SpotpoolStatus `json:"status,omitempty"`
}
//+kubebuilder:object:root=true
type SpotpoolList struct {
metav1.TypeMeta `json:",inline"`
metav1.ListMeta `json:"metadata,omitempty"`
Items []Spotpool `json:"items"`
}
func init() {
SchemeBuilder.Register(&Spotpool{}, &SpotpoolList{})
}4. Controller Development
(1) Reconcile Logic
The controller fetches the desired state, obtains the current running instances, and decides whether to scale up or down.
func (r *SpotpoolReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
log := logf.FromContext(ctx)
spotpool := &devopsjokerbaicomv1.Spotpool{}
if err := r.Get(ctx, req.NamespacedName, spotpool); err != nil {
log.Error(err, "unable to fetch spotpool")
}
runningVmList, err := r.getRunningInstanceIds(spotpool)
if err != nil {
log.Error(err, "get running vm instance failed")
return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
}
runningCount := len(runningVmList)
switch {
case runningCount < int(spotpool.Spec.Minimum):
delta := spotpool.Spec.Minimum - int32(runningCount)
log.Info("creating instances", "delta", delta)
if err = r.runInstances(spotpool, delta); err != nil {
log.Error(err, "unable to create instances")
return ctrl.Result{RequeueAfter: 40 * time.Second}, nil
}
case runningCount > int(spotpool.Spec.Maximum):
delta := int32(runningCount) - spotpool.Spec.Maximum
log.Info("terminating instances", "delta", delta)
if err = r.terminateInstances(spotpool, delta); err != nil {
log.Error(err, "unable to terminate instances")
return ctrl.Result{RequeueAfter: 40 * time.Second}, nil
}
}
return ctrl.Result{RequeueAfter: 40 * time.Second}, nil
}(2) Helper Methods
getRunningInstanceIdscreates a Tencent Cloud SDK client, lists instances, filters by state, updates the CR status, and returns the IDs of running instances.
func (r *SpotpoolReconciler) getRunningInstanceIds(spotpool *devopsjokerbaicomv1.Spotpool) ([]string, error) {
client, err := r.createCVMClient(spotpool.Spec)
if err != nil { return nil, err }
request := cvm.NewDescribeInstancesRequest()
response, err := client.DescribeInstances(request)
if err != nil { return nil, err }
var runningIDs []string
var instances []devopsjokerbaicomv1.Instances
for _, instance := range response.Response.InstanceSet {
if *instance.InstanceState == "RUNNING" || *instance.InstanceState == "PENDING" || *instance.InstanceState == "STARTING" {
runningIDs = append(runningIDs, *instance.InstanceId)
instances = append(instances, devopsjokerbaicomv1.Instances{InstanceId: *instance.InstanceId, PublicIp: *instance.PublicIpAddresses[0]})
}
if len(instance.PublicIpAddresses) == 0 {
return nil, fmt.Errorf("instance %s does not have public ip", *instance.InstanceId)
}
}
spotpool.Status.Instances = instances
if err = r.Status().Update(context.Background(), spotpool); err != nil { return nil, err }
return runningIDs, nil
} runInstancesbuilds a request to create the required number of spot instances and updates the status after creation.
func (r *SpotpoolReconciler) runInstances(spotpool *devopsjokerbaicomv1.Spotpool, count int32) error {
client, err := r.createCVMClient(spotpool.Spec)
if err != nil { return err }
request := cvm.NewRunInstancesRequest()
request.ImageId = common.StringPtr(spotpool.Spec.ImageId)
request.Placement = &cvm.Placement{Zone: common.StringPtr(spotpool.Spec.AvailabilityZone)}
request.InstanceChargeType = common.StringPtr(spotpool.Spec.InstanceChargeType)
request.InstanceCount = common.Int64Ptr(int64(count))
request.InstanceName = common.StringPtr("spotpool" + time.Now().Format("20060102150405"))
request.InstanceType = common.StringPtr(spotpool.Spec.InstanceType)
request.InternetAccessible = &cvm.InternetAccessible{InternetChargeType: common.StringPtr("BANDWIDTH_POSTPAID_BY_HOUR"), InternetMaxBandwidthOut: common.Int64Ptr(1), PublicIpAssigned: common.BoolPtr(true)}
request.LoginSettings = &cvm.LoginSettings{Password: common.StringPtr("Password123")}
request.SecurityGroupIds = common.StringPtrs(spotpool.Spec.SecurityGroupIds)
request.SystemDisk = &cvm.SystemDisk{DiskType: common.StringPtr("CLOUD_BSSD"), DiskSize: common.Int64Ptr(100)}
request.VirtualPrivateCloud = &cvm.VirtualPrivateCloud{SubnetId: common.StringPtr(spotpool.Spec.SubnetId), VpcId: common.StringPtr(spotpool.Spec.VpcId)}
response, err := client.RunInstances(request)
if err != nil { return err }
fmt.Println("run instances success", response.Response.InstanceIdSet)
_, err = r.getRunningInstanceIds(spotpool)
return err
} terminateInstancesselects the excess instances and calls the Tencent Cloud API to delete them, then refreshes the status.
func (r *SpotpoolReconciler) terminateInstances(spotpool *devopsjokerbaicomv1.Spotpool, count int32) error {
client, err := r.createCVMClient(spotpool.Spec)
if err != nil { return err }
running, err := r.getRunningInstanceIds(spotpool)
if err != nil { return err }
instancesIds := running[:count]
request := cvm.NewTerminateInstancesRequest()
request.InstanceIds = common.StringPtrs(instancesIds)
_, err = client.TerminateInstances(request)
if err != nil { return err }
_, err = r.getRunningInstanceIds(spotpool)
return err
}Deploy and Test
Install the CRD: make install Run the controller locally: make run Create a Spotpool manifest (as shown above) and apply it:
kubectl apply -f config/samples/devops.jokerbai.com_v1_spotpool.yamlCheck the resource status: kubectl get spotpool Build and push the Docker image:
make docker-build docker-push IMG=<registry>/spotpool:v1Deploy to the cluster: make deploy IMG=<registry>/spotpool:v1 Cleanup: make undeploy and make uninstall The tutorial demonstrates how to implement a declarative, Cloud‑Native solution that automatically scales GPU spot instances, improving the reliability and elasticity of AI training platforms.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
