How to Build an Automated Kubernetes Inspection Platform with Bash and Prometheus
This article explains how to design and implement a Kubernetes platform inspection system that combines Bash scripts and Prometheus queries to monitor cluster health, core component status, and node resources, providing actionable alerts and a flexible automation framework.
What Is Platform Inspection
Platform inspection is a monitoring tool that evaluates the health of underlying systems, quickly identifying potential risks and offering remediation suggestions.
The tool scans various aspects of a cluster, including performance bottlenecks, component statuses, resource usage, and configuration issues, to improve stability and availability.
Why Inspection Matters
Even with metrics, logs, traces, Grafana, and alerts, inspection adds value by:
Supplementing monitoring for items like certificate expiration, Pod CIDR usage, Etcd and Velero backup status, which are easier to view via scripts than exporters.
Checking the health of Prometheus, VictoriaMetrics, and other components to ensure metrics are being collected.
Providing proactive problem discovery through centralized checks instead of inspecting each Grafana panel individually.
Kubernetes Inspection Key Metrics
The metrics are divided into three categories:
Cluster Overview
Core Component Status
Node Status
PromQL and Bash script contents must be configured for the actual environment.
Cluster Overview
Inspection Item: Node Usage
Description: Checks whether the cluster has spare resources.
Source: bash
#!/bin/bash
set -o errexit
set -o nounset
node_sum=$(kubectl get nodes | awk 'NR>1' | grep -v master -c)
node_ready=$(kubectl get nodes | awk 'NR>1' | grep -v master | grep -v SchedulingDisabled -c)
echo "| ${node_ready}/${node_sum}"
if [[ $node_sum -gt $node_ready ]]; then
echo "success"
else
echo "warning"
fiInspection Item: Pod Remaining Capacity
Description: Determines if there are Pods available for allocation.
Source: prometheus
sum(kube_node_status_capacity{resource='pods'} * on(node) group_left(label_env) kube_node_labels{label_env=~"prod",cluster="core",zone=~"shanghai"} unless on(node) kube_node_role) -
sum(kube_pod_info * on(node) group_left(label_env) kube_node_labels{label_env=~"prod",cluster="core",zone=~"shanghai"} unless on(node) kube_node_role)Threshold: ["<",90] Inspection Item: Pod CIDR Usage
Description: Shows the number of IPs left for Pods.
Source: bash
#!/bin/bash
set -o errexit
set -o nounset
pod_ip_free=$(calicoctl ipam show | grep '%' | awk '{print $12}')
echo "| IP 剩余数量:${pod_ip_free}"
if [[ $pod_ip_free -gt 500 ]]; then
echo "success"
elif [[ $pod_ip_free -gt 100 ]]; then
echo "warning"
else
echo "error"
fiInspection Item: Cluster CPU Usage
Source: prometheus
(1 - avg(label_replace(rate(node_cpu_seconds_total{mode="idle",cluster="core",zone=~"shanghai"}[60s]),"internal_ip","$1","instance","(.+):(\\d+)")) and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info) * 100Threshold: [">",50] Inspection Item: Cluster Memory Usage
Source: prometheus
(1 - sum(label_replace(node_memory_MemAvailable_bytes{cluster="core",zone=~"shanghai"},"internal_ip","$1","instance","(.+):(\\d+)" ) and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info)) / sum(label_replace(node_memory_MemTotal_bytes{cluster="core",zone=~"shanghai"},"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai"} * on(node) group_left(internal_ip) kube_node_info) * 100Threshold: [">",85] Inspection Item: Certificate Expiration
Source: bash
#!/bin/bash
set -o errexit
set -o nounset
ct=$(date -d "$(openssl x509 -in /etc/kubernetes/pki/apiserver.crt -noout -dates | awk -F '=' '/notAfter/{print $2}' | awk '{print $1,$2,$3,$4}')" +%s)
dt=$(date +%s)
expired=$(( (ct-dt)/(60*60*24) ))
echo "| ${expired} 天后过期"
if [[ $expired -gt 60 ]]; then
echo "success"
elif [[ $expired -gt 15 ]]; then
echo "warning"
else
echo "error"
fiInspection Item: Etcd Backup Status
Source: bash
#!/bin/bash
set -o nounset
result=$(find /var/lib/docker/etcd_backup/ -mmin -120)
if [[ -n ${result} ]]; then
echo "正常"
echo "success"
else
echo "异常"
echo "error"
fiInspection Item: Velero Backup Status
Source: bash
#!/bin/bash
set -o nounset
current_date=$(date +%F)
backup_date=$(velero backup get | grep core-shanghai | awk '{print $5}' | sort -nr | head -1)
backup_date_2d=$(date -d "${backup_date} +2 days" +%F)
if [[ $backup_date_2d > $current_date && $backup_date != "" ]]; then
echo "正常"
echo "success"
else
echo "异常"
echo "error"
fiCore Component Status
etcd
Inspection Item: Insufficient etcd Nodes
Source: prometheusOr
sum by(job) (up{job=~".*etcd.*",cluster="core",zone="shanghai"} == bool 1) < ((count by(job) (up{job=~".*etcd.*",cluster="core",zone="shanghai"}) + 1) / 2)Threshold: yes
Inspection Item: etcd Leader Presence
Source: prometheusOr
etcd_server_has_leader{job=~".*etcd.*",cluster="core",zone="shanghai"} == 1Threshold: no
Inspection Item: Frequent etcd Leader Switches
Source: prometheusOr
rate(etcd_server_leader_changes_seen_total{job=~".*etcd.*",cluster="core",zone="shanghai"}[15m]) > 3Threshold: yes
Inspection Item: etcd Request Success Rate
Source: prometheus
100 - max(sum(rate(grpc_server_handled_total{grpc_type="unary",grpc_code!="OK",cluster="core",zone="shanghai"}[1m])) by (grpc_service) / sum(rate(grpc_server_started_total{grpc_type="unary",cluster="core",zone="shanghai"}[1m])) by (grpc_service) * 100.0)Threshold: ["<",99] Inspection Item: etcd Disk WAL Latency
Source: prometheus
max(histogram_quantile(0.99, sum(rate(etcd_disk_wal_fsync_duration_seconds_bucket{cluster="core",zone="shanghai"}[1m])) by (instance,le))) * 1000Threshold:
[">",10]kube-apiserver
Inspection Item: apiserver Health
Source: prometheus
sum(up{job="apiserver",cluster="core",zone="shanghai"}) / count(up{job="apiserver",cluster="core",zone="shanghai"}) * 100Threshold: ["<",90] Inspection Item: apiserver QPS
Source: prometheus
sum(rate(apiserver_request_total{cluster="core",zone="shanghai"}[1m]))Threshold: [">",3000] Inspection Item: apiserver Request Success Rate
Source: prometheus
apiserver_request:availability30d{verb="all",cluster="core",zone="shanghai"} * 100Threshold: ["<",99] Inspection Item: apiserver Request Latency
Source: prometheus
max(cluster_quantile:apiserver_request_duration_seconds:histogram_quantile{cluster="core",zone="shanghai"})Threshold:
[">",1]Node Status
kubelet
Inspection Item: Unready Nodes
Source: prometheusList
sum by(node) (kube_node_status_condition{condition="Ready",job="kube-state-metrics",status="true",cluster="core",zone="shanghai"}) == 0Inspection Item: High PLEG Relist Duration
Source: prometheusList
histogram_quantile(0.99, sum(rate(kubelet_pleg_relist_duration_seconds_bucket{job="kubelet",metrics_path="/metrics",cluster="core",zone="shanghai"}[1m])) by (node,le)) * 1000 > 1000Resource Usage
Inspection Item: Nodes with CPU > 50%
Source: prometheusList
(1 - avg by(internal_ip) (label_replace(rate(node_cpu_seconds_total{mode="idle",cluster="core",zone=~"shanghai"}[60s]),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info)) * 100 > 50Inspection Item: Nodes with Memory > 80%
Source: prometheusList
sum by(internal_ip) (label_replace(1 - (node_memory_MemAvailable_bytes{cluster="core",zone=~"shanghai"} / node_memory_MemTotal_bytes{cluster="core",zone=~"shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info) * 100 > 80Inspection Item: Disk / Usage > 80%
Source: prometheusList
sum by(internal_ip) (label_replace(100 - ((node_filesystem_avail_bytes{job="node-exporter",mountpoint="/",fstype!="rootfs",cluster="core",zone="shanghai"} * 100) / node_filesystem_size_bytes{job="node-exporter",mountpoint="/",fstype!="rootfs",cluster="core",zone="shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") and on(internal_ip) kube_node_labels{cluster="core",zone=~"shanghai",label_env=~"prod"} * on(node) group_left(internal_ip) kube_node_info) > 80Inspection Item: PID Usage > 80%
Source: prometheusList
label_replace(node_processes_threads{cluster="core",zone="shanghai"} / on(instance) min by(instance) (node_processes_max_processes or node_processes_max_threads{cluster="core",zone="shanghai"}),"internal_ip","$1","instance","(.+):(\\d+)") * 100 > 80Inspection Item: FD Usage > 70%
Source: prometheusList
sum by(internal_ip) (label_replace(node_filefd_allocated{job="node-exporter",cluster="core",zone="shanghai"} * 100 / node_filefd_maximum{job="node-exporter",cluster="core",zone="shanghai"},"internal_ip","$1","instance","(.+):(\\d+)") ) > 70Inspection Item: Time Sync Issues
Source: prometheusList
min_over_time(node_timex_sync_status{cluster="core",zone="shanghai"}[5m]) == 0 and node_timex_maxerror_seconds{cluster="core",zone="shanghai"} >= 16Inspection Item: DockerHung Pods
Source: prometheusList
sum by(node) (rate(problem_counter{reason="DockerHung",cluster="core",zone="shanghai"}[1m])) > 0Automated Inspection Platform
The "action source" field in each item can be bash, prometheus, prometheusOr, or prometheusList. Bash scripts reside on the K8s master node and return a result line and a status line (success, warning, error). Prometheus‑based items query metrics and compare them with thresholds.
All execution commands and script names are stored in a MySQL table; adding a new inspection item only requires inserting a rule into the table.
Note: PromQL must be URL‑encoded.
Core pseudo‑code (simplified):
var mu sync.Mutex
type ScannerRequest struct {
CheckKeys []int `json:"check_keys"`
SelectedCluster int `json:"selected_cluster"`
}
func (s *ScannerController) ScannerStart(g *gin.Context) {
mu.Lock()
defer mu.Unlock()
s.store.UpdateAllStatus()
var r ScannerRequest
if err := g.ShouldBindJSON(&r); err != nil {
v2api.AbnormalJsonResponse(g, "", "body parse error: "+err.Error())
return
}
// Load cluster info from JSON strings into a map
// Determine which scanner items to run based on CheckKeys
// For each item launch a goroutine that:
// - Retrieves action_type, action_detail, threshold from DB
// - Replaces placeholders (%22core%22, %22shanghai%22) with actual cluster name/zone
// - Executes the appropriate logic (prometheus, prometheusOr, prometheusList, bash)
// - Updates DB with value and status (success, warning, error)
v2api.NormalJsonResponse(g, "开始巡检", "")
}
// Helper functions for Prometheus queries, SSH execution, etc.Page display screenshots illustrate the UI of the inspection platform.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
