Cloud Native 19 min read

How to Build a GPU Spot‑Pool Operator on Kubernetes with Kubebuilder

This guide walks through creating a Kubernetes Operator using Kubebuilder to manage a GPU spot‑pool on Tencent Cloud, covering CRD design, controller logic, code generation, and deployment steps, enabling automated scaling of GPU resources for AI workloads while illustrating core Cloud‑Native concepts.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How to Build a GPU Spot‑Pool Operator on Kubernetes with Kubebuilder

This article combines Kubernetes, AI, and cloud to develop an AI tool that manages a GPU resource pool via a custom Kubernetes Operator.

Operator Overview

A Kubernetes Operator extends the platform with declarative APIs, allowing users to manage complex applications through custom resources. The Operator watches events and reconciles the actual state to match the desired state defined in a CustomResourceDefinition (CRD).

CRD Design

The CRD describes the desired configuration of the GPU spot‑pool, including minimum/maximum instance counts and cloud provider parameters.

apiVersion: apiextensions.k8s.io/v1beta1
kind: CustomResourceDefinition
metadata:
  name: postgresqls.acid.zalan.do
spec:
  group: acid.zalan.do
  names:
    kind: postgresql
    listKind: postgresqlList
    plural: postgresqls
    singular: postgresql
  scope: Namespaced
  versions:
    - name: v1
      served: true
      storage: true
  additionalPrinterColumns:
    - name: Team
      type: string
      description: Team responsible for Postgres Cluster
      JSONPath: .spec.teamId
    - name: Version
      type: string
      description: PostgreSQL version
      JSONPath: .spec.postgresql.version
    - name: Pods
      type: integer
      description: Number of Pods per Postgres cluster
      JSONPath: .spec.numberOfInstances
    - name: Volume
      type: string
      description: Size of the bound volume
      JSONPath: .spec.volume.size

The CRD consists of apiVersion, kind, metadata, and spec. apiVersion identifies the API group and version, while kind specifies the resource type.

Architecture

Operator Architecture Diagram
Operator Architecture Diagram

The GPU resource pool uses Tencent Cloud spot instances. The Operator runs inside the Kubernetes cluster, monitors the pool size, and automatically adds or removes instances to match the desired count.

Prerequisites

A functional Kubernetes cluster (kind, kubeadm, etc.)

kubebuilder installed

Tencent Cloud Access Key (AK) and Secret Key

Quick Start

1. Design CRD

The CRD for the spot‑pool includes fields such as region, instance type, subnet, VPC, security groups, image ID, and charge type.

apiVersion: devops.jokerbai.com/v1
kind: Spotpool
metadata:
  name: spotpool-sample
spec:
  secretId: <your-secret-id>
  secretKey: <your-secret-key>
  region: ap-singapore
  availabilityZone: ap-singapore-2
  instanceType: GN7.2XLARGE32
  minimum: 2
  maximum: 2
  subnetId: DEFAULT
  vpcId: DEFAULT
  securityGroupIds:
    - sg-xxx
  imageId: img-xxx
  instanceChargeType: SPOTPAID

2. Initialize Project

mkdir spotpool && cd spotpool
kubebuilder init \
  --domain jokerbai.com \
  --repo github.com/joker-bai/spotpool \
  --project-name spotpool \
  --plugins go/v4 \
  --owner "Joker Bai"
kubebuilder create api --group devops.jokerbai.com --version v1 --kind Spotpool

The generated directory structure includes api/v1, config, controllers, and other scaffolding files.

3. CRD Development

(1) Define API

package v1

import (
    metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)

type SpotpoolSpec struct {
    SecretId          string `json:"secretId,omitempty"`
    SecretKey         string `json:"secretKey,omitempty"`
    Region            string `json:"region,omitempty"`
    AvailabilityZone  string `json:"availabilityZone,omitempty"`
    InstanceType      string `json:"instanceType,omitempty"`
    Minimum           int32  `json:"minimum,omitempty"`
    Maximum           int32  `json:"maximum,omitempty"`
    SubnetId          string `json:"subnetId,omitempty"`
    VpcId             string `json:"vpcId,omitempty"`
    SecurityGroupIds  []string `json:"securityGroupIds,omitempty"`
    ImageId           string `json:"imageId,omitempty"`
    InstanceChargeType string `json:"instanceChargeType,omitempty"`
}

type SpotpoolStatus struct {
    Size       int32 `json:"size,omitempty"`
    Conditions []metav1.Condition `json:"conditions,omitempty"`
    Instances  []Instances `json:"instances,omitempty"`
}

type Instances struct {
    InstanceId string `json:"instanceId,omitempty"`
    PublicIp   string `json:"publicIp,omitempty"`
}

//+kubebuilder:object:root=true
//+kubebuilder:subresource:status

type Spotpool struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`
    Spec   SpotpoolSpec   `json:"spec,omitempty"`
    Status SpotpoolStatus `json:"status,omitempty"`
}

//+kubebuilder:object:root=true

type SpotpoolList struct {
    metav1.TypeMeta `json:",inline"`
    metav1.ListMeta `json:"metadata,omitempty"`
    Items []Spotpool `json:"items"`
}

func init() {
    SchemeBuilder.Register(&Spotpool{}, &SpotpoolList{})
}

4. Controller Development

(1) Reconcile Logic

The controller fetches the desired state, obtains the current running instances, and decides whether to scale up or down.

func (r *SpotpoolReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
    log := logf.FromContext(ctx)
    spotpool := &devopsjokerbaicomv1.Spotpool{}
    if err := r.Get(ctx, req.NamespacedName, spotpool); err != nil {
        log.Error(err, "unable to fetch spotpool")
    }
    runningVmList, err := r.getRunningInstanceIds(spotpool)
    if err != nil {
        log.Error(err, "get running vm instance failed")
        return ctrl.Result{RequeueAfter: 10 * time.Second}, nil
    }
    runningCount := len(runningVmList)
    switch {
    case runningCount < int(spotpool.Spec.Minimum):
        delta := spotpool.Spec.Minimum - int32(runningCount)
        log.Info("creating instances", "delta", delta)
        if err = r.runInstances(spotpool, delta); err != nil {
            log.Error(err, "unable to create instances")
            return ctrl.Result{RequeueAfter: 40 * time.Second}, nil
        }
    case runningCount > int(spotpool.Spec.Maximum):
        delta := int32(runningCount) - spotpool.Spec.Maximum
        log.Info("terminating instances", "delta", delta)
        if err = r.terminateInstances(spotpool, delta); err != nil {
            log.Error(err, "unable to terminate instances")
            return ctrl.Result{RequeueAfter: 40 * time.Second}, nil
        }
    }
    return ctrl.Result{RequeueAfter: 40 * time.Second}, nil
}

(2) Helper Methods

getRunningInstanceIds

creates a Tencent Cloud SDK client, lists instances, filters by state, updates the CR status, and returns the IDs of running instances.

func (r *SpotpoolReconciler) getRunningInstanceIds(spotpool *devopsjokerbaicomv1.Spotpool) ([]string, error) {
    client, err := r.createCVMClient(spotpool.Spec)
    if err != nil { return nil, err }
    request := cvm.NewDescribeInstancesRequest()
    response, err := client.DescribeInstances(request)
    if err != nil { return nil, err }
    var runningIDs []string
    var instances []devopsjokerbaicomv1.Instances
    for _, instance := range response.Response.InstanceSet {
        if *instance.InstanceState == "RUNNING" || *instance.InstanceState == "PENDING" || *instance.InstanceState == "STARTING" {
            runningIDs = append(runningIDs, *instance.InstanceId)
            instances = append(instances, devopsjokerbaicomv1.Instances{InstanceId: *instance.InstanceId, PublicIp: *instance.PublicIpAddresses[0]})
        }
        if len(instance.PublicIpAddresses) == 0 {
            return nil, fmt.Errorf("instance %s does not have public ip", *instance.InstanceId)
        }
    }
    spotpool.Status.Instances = instances
    if err = r.Status().Update(context.Background(), spotpool); err != nil { return nil, err }
    return runningIDs, nil
}
runInstances

builds a request to create the required number of spot instances and updates the status after creation.

func (r *SpotpoolReconciler) runInstances(spotpool *devopsjokerbaicomv1.Spotpool, count int32) error {
    client, err := r.createCVMClient(spotpool.Spec)
    if err != nil { return err }
    request := cvm.NewRunInstancesRequest()
    request.ImageId = common.StringPtr(spotpool.Spec.ImageId)
    request.Placement = &cvm.Placement{Zone: common.StringPtr(spotpool.Spec.AvailabilityZone)}
    request.InstanceChargeType = common.StringPtr(spotpool.Spec.InstanceChargeType)
    request.InstanceCount = common.Int64Ptr(int64(count))
    request.InstanceName = common.StringPtr("spotpool" + time.Now().Format("20060102150405"))
    request.InstanceType = common.StringPtr(spotpool.Spec.InstanceType)
    request.InternetAccessible = &cvm.InternetAccessible{InternetChargeType: common.StringPtr("BANDWIDTH_POSTPAID_BY_HOUR"), InternetMaxBandwidthOut: common.Int64Ptr(1), PublicIpAssigned: common.BoolPtr(true)}
    request.LoginSettings = &cvm.LoginSettings{Password: common.StringPtr("Password123")}
    request.SecurityGroupIds = common.StringPtrs(spotpool.Spec.SecurityGroupIds)
    request.SystemDisk = &cvm.SystemDisk{DiskType: common.StringPtr("CLOUD_BSSD"), DiskSize: common.Int64Ptr(100)}
    request.VirtualPrivateCloud = &cvm.VirtualPrivateCloud{SubnetId: common.StringPtr(spotpool.Spec.SubnetId), VpcId: common.StringPtr(spotpool.Spec.VpcId)}
    response, err := client.RunInstances(request)
    if err != nil { return err }
    fmt.Println("run instances success", response.Response.InstanceIdSet)
    _, err = r.getRunningInstanceIds(spotpool)
    return err
}
terminateInstances

selects the excess instances and calls the Tencent Cloud API to delete them, then refreshes the status.

func (r *SpotpoolReconciler) terminateInstances(spotpool *devopsjokerbaicomv1.Spotpool, count int32) error {
    client, err := r.createCVMClient(spotpool.Spec)
    if err != nil { return err }
    running, err := r.getRunningInstanceIds(spotpool)
    if err != nil { return err }
    instancesIds := running[:count]
    request := cvm.NewTerminateInstancesRequest()
    request.InstanceIds = common.StringPtrs(instancesIds)
    _, err = client.TerminateInstances(request)
    if err != nil { return err }
    _, err = r.getRunningInstanceIds(spotpool)
    return err
}

Deploy and Test

Install the CRD: make install Run the controller locally: make run Create a Spotpool manifest (as shown above) and apply it:

kubectl apply -f config/samples/devops.jokerbai.com_v1_spotpool.yaml

Check the resource status: kubectl get spotpool Build and push the Docker image:

make docker-build docker-push IMG=<registry>/spotpool:v1

Deploy to the cluster: make deploy IMG=<registry>/spotpool:v1 Cleanup: make undeploy and make uninstall The tutorial demonstrates how to implement a declarative, Cloud‑Native solution that automatically scales GPU spot instances, improving the reliability and elasticity of AI training platforms.

KubernetesOperatorGPUTencent CloudKubebuilder
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.