Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus
This article details the design and evolution of a highly available, persistent, and dynamically adjustable Kubernetes monitoring solution at Xiaomi, covering initial Falcon‑based approaches, the transition to Prometheus with remote storage via OpenTSDB, federation‑based partitioning, deployment strategies, performance testing, and future enhancements.
Background and Challenges
Xiaomi’s elastic scheduling platform (Ocean) and container platform rely on Kubernetes (k8s) to provide services. Monitoring such large‑scale k8s clusters is complex because each container behaves like a host, leading to thousands of metrics per node (over 10,000) and dynamic creation/destruction of pods.
Key challenges include:
More monitoring dimensions (core services, containers, pods, namespaces, etc.).
Dynamic and volatile monitoring targets.
Explosive growth of metrics with container scale.
Need for dynamic scaling of the monitoring system.
Additional constraints stem from Xiaomi’s internal environment:
Multiple heterogeneous k8s clusters (fusion‑cloud, Ocean, CloudML) with differing deployment, network, and storage models.
Open‑Falcon, the company‑wide monitoring/alerting system, does not support the pull‑based collection model required by k8s and lacks flexible aggregation for hierarchical resources.
Requirement to persist monitoring data using existing databases for long‑term analysis.
Existing Open‑Source Monitoring Options
Two common solutions were evaluated:
Heapster/Metrics‑Server + InfluxDB + Grafana – Simple deployment but limited to node‑level metrics and unsuitable for full‑cluster k8s monitoring.
Exporter + Prometheus + Adapter – Offers multi‑dimensional metrics, powerful PromQL queries, and dynamic service discovery, though it may lose data under 100 % accuracy requirements.
Initial Falcon‑Based Monitoring
The first implementation leveraged Open‑Falcon and custom exporters to collect core metrics (CPU, memory, network) from pods, nodes, and the cluster. Architecture is shown below:
Data sources included cadvisor‑exporter for container metrics, kube‑state‑exporter for pod‑level metrics, and falcon‑agent for physical node metrics. This solution lacked comprehensive coverage (e.g., apiserver, etcd) and did not provide persistent storage.
Migration to Prometheus
After evaluating shortcomings, Prometheus was selected for its native k8s support, high performance (up to 100 k metrics per second), and powerful PromQL query language.
Key components:
Data Sources : node‑exporter, kube‑state‑metrics, cadvisor, plus custom metrics exposed via prometheus.io/scrape: "true" annotations.
Prometheus Processing : Deployed as a Pod with Prom‑Reloader for hot‑reloading configuration stored in a ConfigMap.
Storage Backend : Local TSDB for short‑term data and remote write to OpenTSDB (via a Falcon‑Adapter) for long‑term persistence.
Architecture diagram:
Deployment Model
All monitoring components are deployed as Deployments or DaemonSets to ensure reliability. Configuration files reside in ConfigMaps with automatic reload via Prom‑Reloader. Storage uses both local blocks (2‑hour windows) and remote write to OpenTSDB. Remote read/write interfaces allow Prometheus to interact with any third‑party storage.
Limitations of a Single Prometheus Instance
As cluster size grew, two main issues emerged:
Falcon‑agent and its transfer component became bottlenecks, causing data loss when ingesting >150 000 samples per minute.
Prometheus CPU and memory usage increased sharply; a single instance struggled to scrape all targets, leading to occasional missed metrics.
Performance chart of an online cluster:
Partitioned Monitoring Solution (Federation)
To address scalability, a federated Prometheus architecture was introduced, consisting of a master Prometheus, multiple slave Prometheus instances, and a dedicated kube‑state Prometheus.
Federation allows the master to scrape aggregated data from slaves, reducing load on any single node.
Two partitioning strategies are used:
Functional Partitioning : Different Prometheus instances handle distinct jobs (e.g., per data center).
Horizontal Scaling : The same job is split across multiple slaves using hashmod relabeling.
Configuration example for the master to federate slaves:
- job_name: federate-slave
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"pod:.*|node:.*"}'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kube-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prometheus-slave.*Slave configuration using hashmod to slice node targets:
- job_name: kubelet
scheme: https
kubernetes_sd_configs:
- role: node
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: []
regex: __meta_kubernetes_node_label_(.+)
replacement: "$1"
action: labelmap
- source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
modulus: ${modulus}
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ${slaveId}
action: keepTesting and Performance Validation
Functional tests compared aggregated metrics before and after partitioning, showing >95 % of time‑series differences within 1 %.
Performance tests on a cluster with 1,000 virtual nodes evaluated different load scenarios:
Results indicated that master and kube‑state Prometheus can handle up to 80 k pods per minute, while each slave comfortably scrapes >400 nodes (≈60 pods per node). Remote write to OpenTSDB adds modest overhead.
Future Outlook
The partitioned solution is now deployed in several clusters, offering high availability, persistent storage, and dynamic scaling. Planned improvements include automatic scaling of monitoring components, performance optimization of kube‑state‑metrics, deployment simplification via prometheus‑operator and Helm, and advanced analytics such as capacity‑prediction algorithms.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
