Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus and OpenTSDB
This article details Xiaomi's end‑to‑end, highly available Kubernetes monitoring solution that combines Prometheus, OpenTSDB, and Falcon to handle massive dynamic metrics, ensure persistent storage, and support seamless scaling across multiple clusters.
When Monitoring Meets K8s
Xiaomi's elastic scheduling platform (Ocean) and container platform rely on Kubernetes, requiring a robust monitoring system to maintain service quality. Unlike traditional physical hosts, each container acts as a host, leading to over 10,000 metrics per node and demanding dynamic, scalable monitoring that integrates with the existing Open‑Falcon alarm system.
The monitoring challenges include:
More dimensions: physical host metrics plus core services (apiserver, etcd), container, pod, and namespace metrics.
Dynamic objects: containers are created and destroyed frequently, preventing static pre‑configuration.
Metric explosion: massive metric volume requires efficient processing and visualization.
Need for dynamic scaling as the cluster grows.
Additional internal constraints are the diversity of clusters (fusion‑cloud, Ocean, CloudML), Open‑Falcon’s lack of native pull‑based collection, and the requirement for long‑term persistent storage.
Monitoring Solution and Evolution
Initial Solution
The first implementation used Falcon together with custom exporters to collect core metrics such as pod CPU, memory, and network usage. While it provided basic visibility, it missed many components (e.g., apiserver, etcd), lacked persistent storage, and required manual exporter development.
Prometheus‑Based System
After evaluation, Prometheus was chosen for its native Kubernetes support, high performance (up to 100k metrics per second), and powerful PromQL query language. The architecture includes:
Data sources: node‑exporter, kube‑state‑metrics, cAdvisor, and custom metrics exposed via annotations.
Prometheus deployment as a Pod with Prom‑Reloader for hot‑reloading configuration stored in ConfigMaps.
Storage backend: local TSDB for short‑term data and remote write to OpenTSDB (via an adapter) for long‑term persistence.
Alerting and visualization: metrics are forwarded to Open‑Falcon for alarms and dashboards, while OpenTSDB stores historical data.
Prometheus runs in Deployment/DaemonSet form, ensuring reliability, and all configuration files are managed via ConfigMaps.
Remote storage is implemented through Prometheus’s remote_write and remote_read interfaces, sending samples to OpenTSDB and reading them back when needed.
Issues with a Single‑Instance Setup
As the cluster grew, the single Prometheus instance caused:
High pressure on Falcon‑agent, leading to data loss when traffic exceeded 150,000 samples per minute.
Significant CPU and memory consumption, with occasional scrape failures and performance bottlenecks.
Partitioned Monitoring Solution
To overcome these limits, a federated Prometheus architecture was adopted, splitting monitoring into master, slave, and kube‑state Prometheus instances.
Two partitioning strategies are used:
Functional partition : Different jobs are assigned to separate Prometheus servers across data centers, with a central master aggregating results.
Horizontal scaling : Large target sets are divided among multiple slave instances using hashmod relabeling.
The master Prometheus fetches data from slaves via the /federate endpoint, as shown in the configuration snippet below:
- job_name: federate-slave
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"pod:.*|node:.*"}'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kube-system
relabel_configs:
- source_labels:
- __meta_kubernetes_pod_label_app
action: keep
regex: prometheus-slave.*Slave instances use hashmod to partition node‑level metrics:
- job_name: kubelet
scheme: https
kubernetes_sd_configs:
- role: node
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: []
regex: __meta_kubernetes_node_label_(.+)
replacement: "$1"
action: labelmap
- source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
modulus: ${modulus}
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ${slaveId}
action: keepDeployment uses StatefulSets for slaves (allowing per‑pod configuration via indexed ConfigMaps) and Deployments for master and kube‑state instances. Prom‑Reloader continuously watches for configuration changes.
Testing and Validation
Functional tests showed that over 95% of time‑series comparisons stayed within 1% error after partitioning. Performance tests on a cluster with 1,000 virtual nodes demonstrated that the master can handle up to 80k pods per minute, while each slave comfortably scrapes over 400 nodes (≈60 pods per node). Remote write latency grows with load, indicating future work on the Remote‑Storage‑Adapter.
Outlook
The partitioned solution is now deployed in several clusters, offering high availability, persistent storage, and dynamic scaling. Future improvements include automatic scaling of monitoring components, performance optimization of kube‑state‑metrics, deployment simplification via prometheus‑operator and Helm, and advanced analytics (e.g., capacity prediction) on the collected metrics.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
