Cloud Native 18 min read

Designing a Scalable, High‑Availability Kubernetes Monitoring Solution at Xiaomi

This article details Xiaomi's implementation of a highly available, persistent, and dynamically scalable Kubernetes monitoring system, covering challenges, architecture choices, Prometheus federation, performance testing, and future enhancements for cloud‑native observability.

Efficient Ops
Efficient Ops
Efficient Ops
Designing a Scalable, High‑Availability Kubernetes Monitoring Solution at Xiaomi

This article introduces a high‑availability, persistent, and dynamically adjustable Kubernetes monitoring solution implemented at Xiaomi.

Xiaomi's elastic scheduling platform (Ocean) and container platform rely on Kubernetes (k8s) to provide services, requiring a comprehensive monitoring system to ensure container service quality. Each container behaves like a host, leading to a massive number of metrics (over 10,000 per node).

Key challenges include:

More monitoring dimensions (core services, containers, pods, namespaces, etc.).

Dynamic and frequently changing monitoring objects.

Explosive growth of metrics with container scale.

Need for dynamic scaling of the monitoring system.

Additional considerations involve multiple heterogeneous clusters (fusion cloud, Ocean, CloudML), Open‑Falcon's lack of native k8s pull‑based collection, and the need for long‑term persistent storage.

Existing mature solutions for k8s monitoring are:

Heapster/Metrics‑Server + InfluxDB + Grafana (now deprecated).

Exporter + Prometheus + Adapter.

Monitoring Solution and Evolution

Initial Solution

To quickly deploy k8s, the initial monitoring system leveraged Falcon and custom exporters to collect core metrics such as pod CPU, memory, and network usage.

The architecture used cadvisor‑exporter for container metrics, kube‑state‑exporter for key pod metrics, and Falcon‑agent for physical node data. This initial setup lacked comprehensive data (e.g., apiserver, etcd) and persistent storage.

Prometheus‑Based Monitoring System

Due to the shortcomings of the initial system, Prometheus was selected for its native k8s support, high performance (up to 100k metrics per second), and powerful PromQL query language.

Data Sources : node‑exporter (physical nodes), kube‑state‑metrics (k8s objects), cadvisor (containers), core components (apiserver, etcd, scheduler, gpu), and custom metrics via pod annotations.

Prometheus Processing Module : Deployed as a pod containing Prometheus and Prom‑Reloader. Configuration rules are stored in a ConfigMap and hot‑reloaded without restarting.

Storage Backend : Falcon for alerting and visualization, OpenTSDB for long‑term storage.

Falcon‑Adapter forwards monitoring data to Falcon for alerts, while OpenTSDB‑Adapter writes data to OpenTSDB (HBase‑backed) for persistence.

Deployment : All core monitoring components are deployed via Deployment/DaemonSet with ConfigMaps for automatic updates.

Storage Method : Prometheus uses local block storage for short‑term data (2‑hour blocks) and remote storage (OpenTSDB) for long‑term retention. Multiple storage types (PVC, LVM, local disk) are supported.

Remote write/read interfaces enable Prometheus to push data to OpenTSDB and read it back when needed.

Issues with a single Prometheus instance include pressure on Falcon‑agent causing data loss, and high CPU/memory usage leading to missed metrics as the cluster scales.

To address scalability, Prometheus federation was adopted, allowing hierarchical aggregation and load distribution.

Two partitioning approaches are used:

Functional partitioning: Deploy multiple Prometheus servers per data center, each handling a subset of jobs, with a central Prometheus aggregating results.

Horizontal scaling: Split a large job across multiple Prometheus instances using relabeling and hashmod.

Configuration for federating slave Prometheus instances:

- job_name: federate-slave
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{__name__=~"pod:.*|node:.*"}'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - kube-system
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_label_app
    action: keep
    regex: prometheus-slave.*

Hashmod‑based partitioning for slave Prometheus:

- job_name: kubelet
  scheme: https
  kubernetes_sd_configs:
  - role: node
    namespaces:
      names: []
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: []
    regex: __meta_kubernetes_node_label_(.+)
    replacement: "$1"
    action: labelmap
  - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
    modulus:       ${modulus}
    target_label:  __tmp_hash
    action:        hashmod
  - source_labels: [__tmp_hash]
    regex:         ${slaveId}
    action:        keep

Deployment Details

Master Prometheus and kube‑state Prometheus are deployed via Deployment, while multiple slave Prometheus instances are deployed using StatefulSet to give each pod a unique identifier.

Prom‑Reloader watches ConfigMap changes and generates numbered configurations for each slave.

Testing and Verification

Functional tests compared one‑hour averages before and after partitioning, showing over 95% of time‑series differences within 1%.

Performance tests on a cluster with 1000 virtual nodes demonstrated that master Prometheus and kube‑state Prometheus can handle up to 80,000 pods, while each slave can scrape over 400 nodes (≈60 pods per node).

These results confirm that the federated Prometheus architecture supports up to 80k pods, meeting expected cluster growth.

Outlook

The partitioned monitoring solution is now deployed in several clusters, offering high availability, persistent storage, and dynamic scaling. Future work includes automatic scaling of the monitoring system, performance optimization of kube‑state‑metrics, simplifying deployment with prometheus‑operator and Helm, and applying advanced analytics to monitoring data for capacity prediction and intelligent alerting.

Monitoringcloud nativeObservabilityhigh availabilityKubernetesPrometheus
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.