Cloud Native 11 min read

Building a Scalable Kubernetes Monitoring Architecture and Alert Management

This guide presents a comprehensive, layered Kubernetes monitoring architecture—including control plane, node, resource, and extension layers—detailing high‑availability Prometheus deployment, alert grouping strategies, custom CRD metrics, visualization dashboards, and practical best‑practice recommendations for reliable observability in cloud‑native environments.

dbaplus Community
dbaplus Community
dbaplus Community
Building a Scalable Kubernetes Monitoring Architecture and Alert Management

1. System Architecture Overview

We divide the Kubernetes stack into four logical layers: control‑plane, worker‑node, resource‑object, and extension‑plugin.

Control‑plane layer: etcd, API Server, Scheduler, Controller Manager

Worker‑node layer: Kubelet, Kube‑proxy, CRI, CNI, CSI

Resource‑object layer: Pods, Deployments, StatefulSets, Horizontal Pod Autoscaler

Extension‑plugin layer: CoreDNS, Ingress Controller, KEDA, Argo Rollouts

2. Monitoring System Architecture

The monitoring stack is built on a high‑availability Prometheus deployment with two replicas writing to a VictoriaMetrics cluster via Remote Write. Alertmanager runs as an external cluster for alert deduplication and forwards alerts through Webhook. Persistent storage of alert events is handled by the alertsnitch component, while Grafana visualizes data from VictoriaMetrics.

3. Alert Management

Alert grouping follows a route configuration that aggregates alerts by appid and alertname with specific wait, interval and repeat settings, and sends them to the default receiver.

route:
  group_by: [appid, alertname]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 3h
  receiver: 'default-receiver'

Key practices include:

High‑availability: Deploy Prometheus with two replicas and Remote Write to VictoriaMetrics.

Alert hub: Use an external Alertmanager cluster for convergence and webhook integration.

Data persistence: Store alert events with alertsnitch and use VictoriaMetrics as the time‑series database.

Visualization: Grafana connects to VictoriaMetrics for dashboards.

Alert routing is organized by business dimensions (AppID tags), infrastructure components (SRE‑specific AppID), and all resources are required to carry the AppID label. PromQL queries link alerts to business metrics.

4. Monitoring System Deployment

Deploy Prometheus‑Operator via Helm:

# Add Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts

# Pull the chart (ensure version matches your Kubernetes version)
helm pull prometheus-community/kube-prometheus-stack --version 69.8.2
tar -xvf kube-prometheus-stack-69.8.2.tgz
cd kube-prometheus-stack/

# Optional image registry acceleration
chmod +x update_registry.sh
./update_registry.sh

# Install into the monitoring namespace
helm -n monitoring install kube-prometheus-stack ./ --create-namespace

Image‑registry acceleration script (bash) replaces public registries with a faster mirror:

#!/bin/bash
# Detect OS
if [[ "$(uname)" == "Darwin" ]]; then
  SED_CMD="sed -i ''"
else
  SED_CMD="sed -i"
fi

# Find all YAML files and replace registry URLs
find . -type f -name "*.yaml" -o -name "*.yml" | while read yaml_file; do
  echo "Processing $yaml_file"
  # ... (awk logic omitted for brevity)
  $SED_CMD 's|registry: docker.io|registry: m.daocloud.io|g' "$yaml_file"
  $SED_CMD 's|registry: registry.k8s.io|registry: m.daocloud.io|g' "$yaml_file"
  $SED_CMD 's|registry: quay.io|registry: m.daocloud.io|g' "$yaml_file"
  $SED_CMD 's|registry: ghcr.io|registry: m.daocloud.io|g' "$yaml_file"
  echo "Finished $yaml_file"
done
echo "All YAML files processed!"

Custom metric collection for Argo Rollouts is added via a ConfigMap and RBAC extensions:

# customresourcestate-argo.yaml
resources:
  - groupVersionKind:
      group: argoproj.io
      version: v1alpha1
      kind: Rollout
    metrics:
      - name: argo_rollout_appid
        help: "Argo Rollout application identifier"
        each:
          type: Info
        info:
          labelsFromPath:
            exported_namespace: [metadata, namespace]
          metricLabels:
            appid: .metadata.labels.appid

Deploy the ConfigMap and update RBAC:

kubectl -n monitoring create configmap customresourcestate-config --from-file=customresourcestate-argo.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kube-state-metrics-argo
rules:
- apiGroups: ["apiextensions.k8s.io"]
  resources: ["customresourcedefinitions"]
  verbs: ["list","watch"]
- apiGroups: ["argoproj.io"]
  resources: ["rollouts"]
  verbs: ["list","watch"]

Mount the ConfigMap into the kube‑state‑metrics pod and enable the --custom-resource-state-config-file flag.

5. Monitoring Visualization

The global overview dashboard aggregates clusters, regions, and environments, showing resource watermarks (node count, CPU/Memory totals, pod quota usage) and health indicators such as etcd election status and API server availability. Additional panels provide anomaly monitoring (node load, pod crash loops) and business metrics (QPS, error rate, health‑check success).

Key PromQL functions used include count, unless, sum, group_left, max, label_replace, rate, avg, and min_over_time.

6. Best‑Practice Summary

Tag governance: Enforce strict AppID labeling across all resources to unify monitoring, logging, and tracing.

Collection optimization: Use 15‑second scrape intervals for critical metrics and 1‑minute for business metrics.

Capacity planning: Estimate storage needs as metrics × frequency × 24h × retention_days.

Alert convergence: Tier alerts with immediate notification for critical issues and delayed handling for warnings.

Version management: Keep Helm chart versions aligned with the Kubernetes version and verify compatibility regularly.

Following this layered design and the outlined implementation steps yields a comprehensive, cloud‑native observability solution that covers infrastructure, Kubernetes core components, and application‑level metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeobservabilityKubernetesAlertingPrometheus
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.