Operations 23 min read

How a Unified White‑Screen Ops Platform Transformed Multi‑Cloud Middleware Management

This article details the challenges of traditional middleware operations, explains how Kubernetes and Operators were leveraged to build a unified, visual, and automated platform that standardizes, automates, and visualizes multi‑cloud resource management, and reports the significant efficiency, cost, and safety gains achieved across dozens of clusters.

dbaplus Community
dbaplus Community
dbaplus Community
How a Unified White‑Screen Ops Platform Transformed Multi‑Cloud Middleware Management

Project Background

Traditional middleware operations suffered from scattered management tools, high operational costs, and reliance on opaque command‑line scripts ("black‑screen" operations). The team identified Kubernetes and its Operator framework as a way to provide a unified, declarative, and automated management layer.

Why Kubernetes & Operator?

Standardization: Operations can be expressed as Custom Resources (CR) and handled uniformly.

Automation: Reduces manual steps and human error.

Visualization: A UI can drive the underlying Kubernetes actions, lowering operational complexity.

Core Goals of the Platform

Standardization: Consolidate middleware operational procedures into reusable best‑practice workflows.

Automation: Eliminate dependence on manual scripts and enable end‑to‑end automated actions.

Visualization: Provide a white‑screen UI that makes complex tasks intuitive.

Architecture Overview

The platform consists of several layers:

Multi‑cloud Management Service: Unified hosting of Kubernetes clusters from different cloud providers, offering resource visualization and cross‑cloud scheduling.

Middleware Operations Service: Centralized deployment, scaling, and management of Kafka and Elasticsearch, with a visual interface to reduce SRE effort.

K8s Generic Resource Service: Unified handling of Nodes (labeling, taint management), PersistentVolumes, PVCs, Services, Pods, and CPU Burst, all via CRs.

YAML Management Service: Versioned YAML storage, change audit, and visual diff/rollback capabilities.

Operation Audit Service: Detailed logging of every platform action, integrated with DCheck for compliance checks.

Multi‑Cloud Management

Operators abstract away the need to switch kubeconfig files. Users can manage dozens of clusters from a single UI, avoiding the "kubeconfig switching hell".

Kafka Expansion – From Black‑Screen Script to White‑Screen UI

Traditional script example (simplified):

#!/bin/bash
export KUBECONFIG=/path/to/kubeconfig
kubectl get kafka -n kafka-namespace
kubectl patch kafka my-cluster -n kafka-namespace --type='merge' -p '{"spec":{"kafka":{"replicas":5}}}'
# loop to check pod status …
curl -X POST "http://cruise-control…/rebalance" -d "dryrun=false"
# wait for migration …
echo "Kafka expansion completed!"

The platform replaces this with a one‑click UI where the operator sets the desired replica count, and the system automatically patches the CR, monitors pod readiness, triggers Cruise‑Control data migration, and records the whole process for audit.

Node Management – From Manual Scripts to Visual Dashboard

Legacy Java‑based script scanned each node, parsed CPU, memory, disk type, and applied labels manually, which was error‑prone and slow (often >1 hour). The new service provides:

Real‑time visualization of node metrics (CPU, memory, disk, labels, taints).

Multi‑dimensional filtering (labels, taints, resources, zones).

Batch labeling and taint management via UI, reducing a 1‑hour task to ~3 minutes.

PV & Cloud Disk Management

When a middleware cluster is deleted, its PersistentVolumes remain, leaving orphaned cloud disks that cannot be traced back to owners. The platform introduces:

Visualization of PV‑to‑cloud‑disk mappings.

Automated detection of idle disks.

One‑click release of cloud disks, cutting release time from >15 minutes to ~1 minute and saving >15 万元 per month.

CPU Burst Management

During traffic spikes, CPU usage can hit 100 % and cause pod eviction. The platform’s CPU Burst feature temporarily lifts CPU limits for critical pods, providing an emergency power source that keeps services alive during high‑load events. It is already enabled in >10 Kubernetes clusters and >30 Elasticsearch clusters.

YAML Management Service

YAML files are the source of truth for Kubernetes resources, but manual edits are risky. The service offers:

Version control with add/modify/rollback and diff capabilities.

Full audit trails for every change.

Visual editor to reduce syntax errors.

Project Outcomes

After three development phases, the platform supports:

Standardized operations for Kafka, Elasticsearch, Node, PV, PVC, Service, and Pod.

Automation of >430 white‑screen operations across 300+ middleware clusters.

Node labeling time reduced from >1 hour to 3 minutes; PV release time reduced from >15 minutes to 1 minute.

Release of 675+ idle cloud disks, saving >15 万元 monthly.

Audit logs exceeding 1 020 entries, with compliance checks via DCheck.

Scalable architecture that can incorporate new resources (Deployments, StatefulSets, Ingress, ConfigMaps, Secrets, custom resources like DMQ, Pulsar, ZK).

Experience & Reflections

Key lessons include the importance of standardization, tightly coupling tooling with processes, and embedding audit/compliance into every operation. Challenges remain in integrating with other platforms (e.g., KubeOne) and expanding test coverage for new scenarios.

Future Outlook

The team plans to extend white‑screen support to more Kubernetes resources, introduce AI‑driven fault‑auto‑healing, improve multi‑cloud integration, and continuously refine the user experience based on feedback.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AutomationOperationsKubernetesmiddlewaremulti-cloudOperator
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.