Operations 20 min read

Measuring Ops Automation Rate and Building a Coding Platform with Taishan‑Qilin

This article explains how to measure the operations automation rate, outlines the challenges of manual ops, and provides a step‑by‑step guide to creating a coding‑based automation platform on Taishan‑Qilin, including formulas, code examples, deployment, and real‑world results.

Efficient Ops
Efficient Ops
Efficient Ops
Measuring Ops Automation Rate and Building a Coding Platform with Taishan‑Qilin

This article introduces how to measure the operations automation rate and presents a platform that enables ops engineers to implement automation through code, ushering ops into a new era.

Introduction

As business systems and middleware become increasingly complex, traditional manual operations face many challenges and limitations. Hand‑written scripts are inefficient and risky, making automation essential for operations teams.

Enabling ops engineers to autonomously develop automation tasks through coding improves efficiency, reduces risk, and opens new possibilities for the field.

Challenges and Limitations of Operations

Complex Script Management and Manual Operations

Script maintenance and version control become increasingly difficult over time, especially in multi‑person teams, leading to accidental use of wrong scripts.

Manual operations are prone to human error, which can cause system failures or data loss.

Human Errors and Dependency on Individual Skills

Over‑reliance on a few individuals makes the team vulnerable when key personnel are absent.

Lack of standardized processes increases the risk of mistakes.

Limited Personal Growth

Ops engineers often work 24/7 on incident response, limiting time for learning and development.

With the rise of cloud computing, many traditional ops tasks are being replaced by automated solutions, requiring engineers to adopt new skills to stay competitive.

Why Operations Automation Matters

Cost

Server resources cost billions annually; as scale grows, cost control becomes critical, making automated cost‑allocation mechanisms essential.

Efficiency

Routine tasks such as resource allocation, scaling, health checks, and service restarts are repetitive; automation frees engineers to focus on higher‑value work.

Stability

Automation reduces human error, ensuring more stable system operation and enabling rapid detection, response, and recovery.

Defining the Ops Automation Rate

The automation rate for the technical support department is calculated as:

<code>Ops Automation Rate = (Number of automated operations via Taishan‑Qilin) / (Number of manual operations via bastion host) + (Number of automated operations)</code>

The numerator counts automated commands, functions, or orchestration tasks; the denominator counts manual operations performed after logging into the bastion host, plus the numerator.

Since April, the automation rate rose from 3% in Q2 to 63% currently.

Why Ops Engineers Should Code Their Own Automation

Reduce Communication Overhead: Engineers understand their own needs and can develop tailored tools, cutting down on back‑and‑forth with platform teams.

Rapid Response to Requirements: Direct development allows quick adaptation to business changes without waiting for platform roadmaps.

Save Maintenance Costs: Custom code avoids duplicated effort across teams and focuses on business logic.

Promote Professional Growth: Coding enhances programming, system understanding, and problem‑solving skills.

Case Study: ChubaoFS Ops Automation

ChubaoFS has built 43 atomic ops functions and 18 orchestration tasks, executing about 500 automated tasks weekly.

Step‑by‑Step Implementation on Taishan‑Qilin

1. Request Ops System Menu

Contact the platform administrator to create a menu and generate an auth file for later controller development.

<code>apiVersion: v1
clusters:
- cluster:
    certificate-authority: ca.pem
    server: https://xxx.jd.com:80
  name: kubernetes
contexts:
- context:
    cluster: kubernetes
    user: kubecfg
  name: default
current-context: default
kind: Config
preferences: {}
users:
- name: kubecfg
  user:
    client-certificate-data: xxxxx(拥有菜单对应的namespace所有权限)
    client-key-data: xxxxx(拥有菜单对应的namespace所有权限)</code>

2. Create Ops Function

Choose either an HTTP‑based service or a custom Kubernetes CRD controller. The guide focuses on the custom controller approach.

3. Write Controller Code

Download the provided template and add the auth file. The template includes two examples: a single‑function controller and a multi‑function controller.

<code>package main
import (
    "controllers/example/api/web/service"
    "flag"
    "os"
    examplev1 "controllers/example/api/v1"
    "controllers/example/controllers"
    "k8s.io/apimachinery/pkg/runtime"
    clientgoscheme "k8s.io/client-go/kubernetes/scheme"
    _ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/log/zap"
    // +kubebuilder:scaffold:imports
)

var (
    scheme   = runtime.NewScheme()
    setupLog = ctrl.Log.WithName("setup")
)

func init() {
    _ = clientgoscheme.AddToScheme(scheme)
    _ = examplev1.AddToScheme(scheme)
    // +kubebuilder:scaffold:scheme
}

func main() {
    var metricsAddr string
    var enableLeaderElection bool
    flag.StringVar(&metricsAddr, "metrics-addr", ":8090", "The address the metric endpoint binds to.")
    flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, "Enable leader election for controller manager.")
    flag.Parse()
    ctrl.SetLogger(zap.New(func(o *zap.Options) { o.Development = true }))
    mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{Scheme: scheme, MetricsBindAddress: metricsAddr, LeaderElection: enableLeaderElection, Port: 9443})
    if err != nil { setupLog.Error(err, "unable to start manager"); os.Exit(1) }
    go service.RunServer(mgr)
    if err = (&controllers.ExampleKindReconciler{Client: mgr.GetClient(), Log: ctrl.Log.WithName("controllers").WithName("ExampleKind"), Scheme: mgr.GetScheme()}).SetupWithManager(mgr); err != nil {
        setupLog.Error(err, "unable to create controller", "controller", "ExampleKind"); os.Exit(1)
    }
    setupLog.Info("starting manager")
    if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { setupLog.Error(err, "problem running manager"); os.Exit(1) }
}
</code>
<code>package controllers
import (
    "context"
    "strconv"
    "github.com/go-logr/logr"
    "k8s.io/apimachinery/pkg/runtime"
    ctrl "sigs.k8s.io/controller-runtime"
    "sigs.k8s.io/controller-runtime/pkg/client"
    examplev1 "controllers/example/api/v1"
)

type ExampleKindReconciler struct {
    client.Client
    Log    logr.Logger
    Scheme *runtime.Scheme
}

var num = 0

func (r *ExampleKindReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
    num++
    ctx := context.Background()
    _ = r.Log.WithValues("examplekind", req.NamespacedName)
    example := &examplev1.ExampleKind{}
    if err := r.Get(ctx, req.NamespacedName, example); err != nil {
        r.Log.V(1).Info("couldn't find module:" + req.String())
    } else {
        r.Log.V(1).Info("接收Moduler资源的变更", "Resource.spec", example.Spec)
        r.Log.V(1).Info("接收Moduler资源的变更", "Status", example.Status)
    }
    // Simplified handling logic
    if example.Status.Event == "created" {
        example.Status.Event = "created_done"
        example.Spec.Ba += strconv.Itoa(num)
        r.Update(ctx, example)
    }
    // other event handling omitted for brevity
    return ctrl.Result{}, nil
}

func (r *ExampleKindReconciler) SetupWithManager(mgr ctrl.Manager) error {
    return ctrl.NewControllerManagedBy(mgr).For(&examplev1.ExampleKind{}).Complete(r)
}
</code>

4. Deploy the Function

Package the compiled controller into a container and deploy it via the Taishan‑Qilin platform, optionally applying any environment‑specific requirements.

5. Publish the Function

After deployment, click “Publish” on the platform and grant authorization so other ops engineers can use the function.

6. Execute the Function

Functions can be run directly or orchestrated into complex automation scenarios using the platform’s workflow editor.

7. View Execution Records

The platform provides detailed logs, parameters, and results for each execution.

Taishan‑Qilin Platform Overview

The platform extends Kubernetes with Custom Resource Definitions (CRDs) to offer a programmable, unified ops environment, embodying Infrastructure‑as‑Code principles and declarative APIs.

Key capabilities include:

Command execution with concurrency, timeout, and kill controls.

Scheduled tasks for inspections, backups, log cleanup, etc.

Resource visualization for assets, databases, middleware, applications.

Resource operations that attach custom ops functions for quick actions.

Ops orchestration: a graphical workflow editor that chains atomic functions and approval steps, enabling complex automated jobs with safety checks.

cloud nativeKubernetesDevOpsoperations automationCRD
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.