Cloud Native 15 min read

Design and Implementation of the kjob Asynchronous Task Scheduling Platform on Kubernetes

The 37Game team built the cloud‑native kjob platform to replace VM‑based schedulers, providing a unified, highly available Kubernetes solution that manages both CronJob‑style scheduled tasks and long‑running Deployments through a backend‑agent architecture, offering CRUD operations, rich configuration, real‑time monitoring, alerting, and seamless migration.

37 Interactive Technology Team
37 Interactive Technology Team
37 Interactive Technology Team
Design and Implementation of the kjob Asynchronous Task Scheduling Platform on Kubernetes

The 37Game technical team migrated all ToC services from cloud VMs to a Kubernetes cluster and needed to bring backend asynchronous tasks into the same environment. The main challenge was how to migrate and manage these asynchronous jobs (both long‑running and scheduled) on Kubernetes.

Requirements for the task management platform:

Schedule tasks on Kubernetes with high stability and availability; the platform should keep tasks running even if the platform itself fails.

Provide basic CRUD operations (create, start, stop, update, delete) with custom configuration capabilities.

Manage business attributes of tasks such as description, owner, and permission control.

Offer observation data, monitoring and alerting for task execution status.

After clarifying the requirements, the team evaluated two existing products: a public cloud provider’s task scheduler and an internal product that only runs on VMs. Both did not satisfy the need for a native Kubernetes solution, so a self‑built platform – kjob – was chosen.

Benefits of the self‑built kjob platform:

Provides a unified asynchronous task management platform for the technology center, reducing operational overhead.

Unifies the architecture of web services and asynchronous tasks, simplifying management and maintenance.

Leverages Kubernetes capabilities for resource allocation and supports multiple scheduling strategies.

Supports both scheduled (CronJob) and long‑running (Deployment) tasks, with real‑time execution monitoring and alerting.

Overall structure: The platform consists of a management backend and an agent deployed in the Kubernetes cluster (named kjob). The backend handles task distribution, operation management, and permission control, while the agent translates task definitions into Kubernetes objects (CronJob for scheduled tasks, Deployment for long‑running tasks) and provides status retrieval and alerting.

System architecture: The backend is built with the TCF framework. Users interact with the backend to create or modify tasks, which are then sent via HTTP to the appropriate Kubernetes clusters. The kjob agent receives the request, creates or updates the corresponding Kubernetes resources, and reports status back to the backend.

Scheduler interface definition (Go):

// IScheduler is the scheduler interface covering both scheduled and long‑running tasks
type IScheduler interface {
    CreateJob(ctx context.Context, entry JobEntry) error
    UpdateJob(ctx context.Context, entry JobEntry, all ...bool) error
    DeleteJob(ctx context.Context, entry JobEntry) error
    GetJobs(ctx context.Context, queries []QueryEntry) (map[string]ReportResult, error)
    GetExecutionLog(ctx context.Context, query QueryEntry) ([]ExecutionLog, error)

    createCronJob(ctx context.Context, entry JobEntry) error
    createDaemonJob(ctx context.Context, entry JobEntry) error
    updateCronJob(ctx context.Context, entry JobEntry, all ...bool) error
    updateDaemonJob(ctx context.Context, entry JobEntry, all ...bool) error
    deleteCronJob(ctx context.Context, entry JobEntry) error
    deleteDaemonJob(ctx context.Context, entry JobEntry) error
    getCronJobs(ctx context.Context, query QueryEntry) (ReportResult, error)
    getDaemonJobs(ctx context.Context, query QueryEntry) (ReportResult, error)
    getCronJobExecutionLog(ctx context.Context, query QueryEntry) ([]ExecutionLog, error)
    getDaemonJobExecutionLog(ctx context.Context, query QueryEntry) ([]ExecutionLog, error)
}

When updating resources, the platform uses client-go ’s RetryOnConflict to avoid race conditions:

err := retry.RetryOnConflict(retry.DefaultRetry, func() error {
    // update logic here
})

Updating a Kubernetes object requires preserving the resourceVersion. The recommended approach is to fetch the existing object, copy its ObjectMeta into the new object, and then call Update for a full replacement.

Update(ctx context.Context, cronJob *v1.CronJob, opts metav1.UpdateOptions) (*v1.CronJob, error)

Task status definition: The platform normalizes task states into four categories – Pending, Running, Failed, Stopped – regardless of whether the underlying resource is a Deployment or a CronJob. Determining the status of a CronJob involves traversing from CronJob → Job → Pod and aggregating their conditions.

Monitoring module design: Because the platform operates asynchronously (Kubernetes creates the actual workload), a dedicated monitoring component polls the cluster, checks task health, and sends alerts. The design uses a pipeline fan‑out pattern: a single goroutine reads task entries from the database (protected by a distributed lock), pushes them into a channel, and a configurable pool of worker goroutines consumes the channel to query Kubernetes and evaluate health.

func (s *StatusMonitor) CheckJobs() {
    jobEntryChan := s.walkSourceData()
    errResultChan := make(chan scheduler.ReportResult)
    var wg sync.WaitGroup
    wg.Add(s.parallelism)
    for i := 0; i < s.parallelism; i++ {
        go func() {
            s.checkStatus(jobEntryChan, errResultChan)
            wg.Done()
        }()
    }
    go func() {
        wg.Wait()
        close(errResultChan)
    }()
    for r := range errResultChan {
        if s.needAlert(r) {
            for _, c := range channel.GetAllChannel() {
                s.log(r)
                c.Send(r)
            }
        }
    }
}

func (s *StatusMonitor) checkStatus(jobEntryChan <-chan scheduler.JobEntry, errResultChan chan<- scheduler.ReportResult) {
    for job := range jobEntryChan {
        if job.Status == scheduler.StatusStop || job.Status == scheduler.StatusNew {
            continue
        }
        reports, err := s.jobScheduler.GetJobs(context.Background(), []scheduler.QueryEntry{{JobName: job.Name, JobType: job.JobType, Cluster: job.Cluster, Namespace: job.Namespace}})
        if err != nil {
            s.logger.Errorf(context.Background(), "status monitor check job[%s] error: %v", job.Name, err)
            continue
        }
        result, ok := reports[job.Name]
        if !ok {
            s.logger.Errorf(context.Background(), "status monitor missing result for job[%s]", job.Name)
            continue
        }
        if job.Status != result.JobStatus {
            // update DB status
        }
        if result.JobStatus == scheduler.StatusBad {
            select {
            case errResultChan <- result:
            case <-s.done:
                return
            }
        }
    }
}

Core highlights:

Simple mode: only the essential fields are required for creating a CronJob or Deployment, reducing user input.

Advanced mode: exposes full Kubernetes configuration (resource limits, affinity, env vars, lifecycle, etc.).

Task import: users can upload a JSON manifest exported from kubectl to let kjob adopt existing CronJobs or Deployments without service interruption.

Support for multiple scheduling types, concurrent task limits, and per‑task timeout settings.

Image repository search to avoid manually typing long image URLs.

Conclusion: The kjob platform provides a unified, cloud‑native solution for managing asynchronous tasks on Kubernetes, offering high availability, rich configuration, real‑time monitoring, and seamless migration from legacy VM‑based schedulers.

KubernetesGotask schedulingCloud-nativeAsynchronous Jobs
37 Interactive Technology Team
Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.