Measuring Ops Automation Rate and Building a Coding Platform with Taishan‑Qilin
This article explains how to measure the operations automation rate, outlines the challenges of manual ops, and provides a step‑by‑step guide to creating a coding‑based automation platform on Taishan‑Qilin, including formulas, code examples, deployment, and real‑world results.
This article introduces how to measure the operations automation rate and presents a platform that enables ops engineers to implement automation through code, ushering ops into a new era.
Introduction
As business systems and middleware become increasingly complex, traditional manual operations face many challenges and limitations. Hand‑written scripts are inefficient and risky, making automation essential for operations teams.
Enabling ops engineers to autonomously develop automation tasks through coding improves efficiency, reduces risk, and opens new possibilities for the field.
Challenges and Limitations of Operations
Complex Script Management and Manual Operations
Script maintenance and version control become increasingly difficult over time, especially in multi‑person teams, leading to accidental use of wrong scripts.
Manual operations are prone to human error, which can cause system failures or data loss.
Human Errors and Dependency on Individual Skills
Over‑reliance on a few individuals makes the team vulnerable when key personnel are absent.
Lack of standardized processes increases the risk of mistakes.
Limited Personal Growth
Ops engineers often work 24/7 on incident response, limiting time for learning and development.
With the rise of cloud computing, many traditional ops tasks are being replaced by automated solutions, requiring engineers to adopt new skills to stay competitive.
Why Operations Automation Matters
Cost
Server resources cost billions annually; as scale grows, cost control becomes critical, making automated cost‑allocation mechanisms essential.
Efficiency
Routine tasks such as resource allocation, scaling, health checks, and service restarts are repetitive; automation frees engineers to focus on higher‑value work.
Stability
Automation reduces human error, ensuring more stable system operation and enabling rapid detection, response, and recovery.
Defining the Ops Automation Rate
The automation rate for the technical support department is calculated as:
<code>Ops Automation Rate = (Number of automated operations via Taishan‑Qilin) / (Number of manual operations via bastion host) + (Number of automated operations)</code>The numerator counts automated commands, functions, or orchestration tasks; the denominator counts manual operations performed after logging into the bastion host, plus the numerator.
Since April, the automation rate rose from 3% in Q2 to 63% currently.
Why Ops Engineers Should Code Their Own Automation
Reduce Communication Overhead: Engineers understand their own needs and can develop tailored tools, cutting down on back‑and‑forth with platform teams.
Rapid Response to Requirements: Direct development allows quick adaptation to business changes without waiting for platform roadmaps.
Save Maintenance Costs: Custom code avoids duplicated effort across teams and focuses on business logic.
Promote Professional Growth: Coding enhances programming, system understanding, and problem‑solving skills.
Case Study: ChubaoFS Ops Automation
ChubaoFS has built 43 atomic ops functions and 18 orchestration tasks, executing about 500 automated tasks weekly.
Step‑by‑Step Implementation on Taishan‑Qilin
1. Request Ops System Menu
Contact the platform administrator to create a menu and generate an auth file for later controller development.
<code>apiVersion: v1
clusters:
- cluster:
certificate-authority: ca.pem
server: https://xxx.jd.com:80
name: kubernetes
contexts:
- context:
cluster: kubernetes
user: kubecfg
name: default
current-context: default
kind: Config
preferences: {}
users:
- name: kubecfg
user:
client-certificate-data: xxxxx(拥有菜单对应的namespace所有权限)
client-key-data: xxxxx(拥有菜单对应的namespace所有权限)</code>2. Create Ops Function
Choose either an HTTP‑based service or a custom Kubernetes CRD controller. The guide focuses on the custom controller approach.
3. Write Controller Code
Download the provided template and add the auth file. The template includes two examples: a single‑function controller and a multi‑function controller.
<code>package main
import (
"controllers/example/api/web/service"
"flag"
"os"
examplev1 "controllers/example/api/v1"
"controllers/example/controllers"
"k8s.io/apimachinery/pkg/runtime"
clientgoscheme "k8s.io/client-go/kubernetes/scheme"
_ "k8s.io/client-go/plugin/pkg/client/auth/gcp"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/log/zap"
// +kubebuilder:scaffold:imports
)
var (
scheme = runtime.NewScheme()
setupLog = ctrl.Log.WithName("setup")
)
func init() {
_ = clientgoscheme.AddToScheme(scheme)
_ = examplev1.AddToScheme(scheme)
// +kubebuilder:scaffold:scheme
}
func main() {
var metricsAddr string
var enableLeaderElection bool
flag.StringVar(&metricsAddr, "metrics-addr", ":8090", "The address the metric endpoint binds to.")
flag.BoolVar(&enableLeaderElection, "enable-leader-election", false, "Enable leader election for controller manager.")
flag.Parse()
ctrl.SetLogger(zap.New(func(o *zap.Options) { o.Development = true }))
mgr, err := ctrl.NewManager(ctrl.GetConfigOrDie(), ctrl.Options{Scheme: scheme, MetricsBindAddress: metricsAddr, LeaderElection: enableLeaderElection, Port: 9443})
if err != nil { setupLog.Error(err, "unable to start manager"); os.Exit(1) }
go service.RunServer(mgr)
if err = (&controllers.ExampleKindReconciler{Client: mgr.GetClient(), Log: ctrl.Log.WithName("controllers").WithName("ExampleKind"), Scheme: mgr.GetScheme()}).SetupWithManager(mgr); err != nil {
setupLog.Error(err, "unable to create controller", "controller", "ExampleKind"); os.Exit(1)
}
setupLog.Info("starting manager")
if err := mgr.Start(ctrl.SetupSignalHandler()); err != nil { setupLog.Error(err, "problem running manager"); os.Exit(1) }
}
</code> <code>package controllers
import (
"context"
"strconv"
"github.com/go-logr/logr"
"k8s.io/apimachinery/pkg/runtime"
ctrl "sigs.k8s.io/controller-runtime"
"sigs.k8s.io/controller-runtime/pkg/client"
examplev1 "controllers/example/api/v1"
)
type ExampleKindReconciler struct {
client.Client
Log logr.Logger
Scheme *runtime.Scheme
}
var num = 0
func (r *ExampleKindReconciler) Reconcile(req ctrl.Request) (ctrl.Result, error) {
num++
ctx := context.Background()
_ = r.Log.WithValues("examplekind", req.NamespacedName)
example := &examplev1.ExampleKind{}
if err := r.Get(ctx, req.NamespacedName, example); err != nil {
r.Log.V(1).Info("couldn't find module:" + req.String())
} else {
r.Log.V(1).Info("接收Moduler资源的变更", "Resource.spec", example.Spec)
r.Log.V(1).Info("接收Moduler资源的变更", "Status", example.Status)
}
// Simplified handling logic
if example.Status.Event == "created" {
example.Status.Event = "created_done"
example.Spec.Ba += strconv.Itoa(num)
r.Update(ctx, example)
}
// other event handling omitted for brevity
return ctrl.Result{}, nil
}
func (r *ExampleKindReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).For(&examplev1.ExampleKind{}).Complete(r)
}
</code>4. Deploy the Function
Package the compiled controller into a container and deploy it via the Taishan‑Qilin platform, optionally applying any environment‑specific requirements.
5. Publish the Function
After deployment, click “Publish” on the platform and grant authorization so other ops engineers can use the function.
6. Execute the Function
Functions can be run directly or orchestrated into complex automation scenarios using the platform’s workflow editor.
7. View Execution Records
The platform provides detailed logs, parameters, and results for each execution.
Taishan‑Qilin Platform Overview
The platform extends Kubernetes with Custom Resource Definitions (CRDs) to offer a programmable, unified ops environment, embodying Infrastructure‑as‑Code principles and declarative APIs.
Key capabilities include:
Command execution with concurrency, timeout, and kill controls.
Scheduled tasks for inspections, backups, log cleanup, etc.
Resource visualization for assets, databases, middleware, applications.
Resource operations that attach custom ops functions for quick actions.
Ops orchestration: a graphical workflow editor that chains atomic functions and approval steps, enabling complex automated jobs with safety checks.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.