How Xiaomi Scaled Kubernetes Monitoring with Prometheus and Open‑Falcon
This article details Xiaomi's Ocean elastic scheduling platform's challenges in monitoring massive Kubernetes clusters, the transition from Open‑Falcon to a Prometheus‑based solution with remote storage, partitioned deployment strategies, performance testing, and future plans for automated scaling and data analytics.
1. When Monitoring Meets K8s
Traditional physical‑host monitoring does not scale to containers: each container appears as a host, leading to over 10,000 metrics per node and a massive total metric count. Xiaomi needed to integrate Kubernetes monitoring into its existing Open‑Falcon alerting system without reinventing the wheel.
More dimensions: core services (API server, Etcd), container, pod, namespace, etc.
Dynamic targets: containers are created and destroyed frequently.
Metric explosion: the number of metrics grows with cluster size.
Dynamic scaling: the monitoring system must expand and shrink with the cluster.
The elastic scheduling platform (Ocean) runs multiple Kubernetes clusters, including fused‑cloud clusters, Ocean clusters, CloudML clusters, totaling over a thousand machines.
2. Monitoring Solutions and Evolution
Initial Solution
The first implementation used Open‑Falcon together with custom exporters to collect core metrics (CPU, memory, network) from pods, nodes, and API server components. The architecture relied on cAdvisor‑exporter, kube‑state‑exporter, and Falcon‑agent.
Switch to Prometheus
Due to Open‑Falcon’s lack of native Kubernetes support and limited aggregation capabilities, Xiaomi evaluated industry solutions and chose Prometheus for its native K8s service discovery, high performance (up to 100 k metrics per second), and powerful PromQL query language.
Key components of the Prometheus‑based monitoring stack:
Data sources: node‑exporter, kube‑state‑metrics, cAdvisor, API Server, Etcd, Scheduler, GPU, and custom metrics via prometheus.io/scrape: "true" annotations.
Prometheus deployment as a pod containing Prometheus and prom‑reloader for hot‑reloading config maps.
Configuration stored in ConfigMaps; prom‑reloader watches for changes and reloads without restart.
Remote Storage Integration
Prometheus stores data locally for short‑term retention (2‑hour blocks). To achieve long‑term storage, Xiaomi built an OpenTSDB adapter that writes samples to OpenTSDB (HBase‑backed) via Prometheus’s remote_write interface, and a corresponding remote_read adapter for queries.
Remote write configuration example:
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"pod:.*|node:.*"}'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kube-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prometheus-slave.*Partitioned Monitoring Architecture
To handle large‑scale clusters, Xiaomi introduced a federation‑based partitioning scheme:
Functional partitioning: Deploy multiple Prometheus servers (master and slaves) per data center, each responsible for a subset of jobs.
Horizontal scaling: Split high‑cardinality targets across slaves using hashmod relabeling.
Federation configuration (master scraping slaves):
job_name: federate-slave
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{__name__=~"pod:.*|node:.*"}'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- kube-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: prometheus-slave.*Slave configuration using hashmod to partition node‑exporter targets:
- job_name: kubelet
scheme: https
kubernetes_sd_configs:
- role: node
tls_config:
insecure_skip_verify: true
relabel_configs:
- source_labels: []
regex: '__meta_kubernetes_node_label_(.+)'
replacement: "$1"
action: labelmap
- source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
modulus: ${modulus}
target_label: __tmp_hash
action: hashmod
- source_labels: [__tmp_hash]
regex: ${slaveId}
action: keepDeployment Strategy
All Prometheus components (master, kube‑state, slaves) are deployed via Deployment or StatefulSet. Slave instances use a StatefulSet so each pod gets a deterministic name (slave‑0, slave‑1). prom‑reloader watches ConfigMaps and injects the pod name to generate per‑slave configuration.
Testing and Validation
Functional tests compared metric aggregates before and after partitioning, showing >95 % of time‑series differences within 1 %.
Performance tests on a cluster with 1 000 virtual nodes demonstrated:
Master Prometheus can scrape up to 80 k pods within a 1‑minute interval; bottleneck is kube‑state‑metrics scaling.
Slave Prometheus instances handle >400 nodes (≈60 pods/node) each; remote write adds modest overhead.
Overall, the partitioned architecture supports up to 80 k pods, meeting projected growth.
3. Outlook
Future work includes automatic scaling of the monitoring stack, performance optimisation of kube‑state‑metrics, tighter integration with prometheus‑operator and Helm for simplified deployment, and applying advanced analytics to monitoring data for capacity‑prediction and proactive scaling.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
