Cloud Native 16 min read

How Xiaomi Scaled Kubernetes Monitoring with Prometheus and Open‑Falcon

This article details Xiaomi's Ocean elastic scheduling platform's challenges in monitoring massive Kubernetes clusters, the transition from Open‑Falcon to a Prometheus‑based solution with remote storage, partitioned deployment strategies, performance testing, and future plans for automated scaling and data analytics.

dbaplus Community

Jul 23, 2019

How Xiaomi Scaled Kubernetes Monitoring with Prometheus and Open‑Falcon

1. When Monitoring Meets K8s

Traditional physical‑host monitoring does not scale to containers: each container appears as a host, leading to over 10,000 metrics per node and a massive total metric count. Xiaomi needed to integrate Kubernetes monitoring into its existing Open‑Falcon alerting system without reinventing the wheel.

More dimensions: core services (API server, Etcd), container, pod, namespace, etc.

Dynamic targets: containers are created and destroyed frequently.

Metric explosion: the number of metrics grows with cluster size.

Dynamic scaling: the monitoring system must expand and shrink with the cluster.

The elastic scheduling platform (Ocean) runs multiple Kubernetes clusters, including fused‑cloud clusters, Ocean clusters, CloudML clusters, totaling over a thousand machines.

2. Monitoring Solutions and Evolution

Initial Solution

The first implementation used Open‑Falcon together with custom exporters to collect core metrics (CPU, memory, network) from pods, nodes, and API server components. The architecture relied on cAdvisor‑exporter, kube‑state‑exporter, and Falcon‑agent.

Switch to Prometheus

Due to Open‑Falcon’s lack of native Kubernetes support and limited aggregation capabilities, Xiaomi evaluated industry solutions and chose Prometheus for its native K8s service discovery, high performance (up to 100 k metrics per second), and powerful PromQL query language.

Key components of the Prometheus‑based monitoring stack:

Data sources: node‑exporter, kube‑state‑metrics, cAdvisor, API Server, Etcd, Scheduler, GPU, and custom metrics via prometheus.io/scrape: "true" annotations.

Prometheus deployment as a pod containing Prometheus and prom‑reloader for hot‑reloading config maps.

Configuration stored in ConfigMaps; prom‑reloader watches for changes and reloads without restart.

Remote Storage Integration

Prometheus stores data locally for short‑term retention (2‑hour blocks). To achieve long‑term storage, Xiaomi built an OpenTSDB adapter that writes samples to OpenTSDB (HBase‑backed) via Prometheus’s remote_write interface, and a corresponding remote_read adapter for queries.

Remote write configuration example:

honor_labels: true
metrics_path: '/federate'
params:
  'match[]':
    - '{__name__=~"pod:.*|node:.*"}'
kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
        - kube-system
relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    action: keep
    regex: prometheus-slave.*

Partitioned Monitoring Architecture

To handle large‑scale clusters, Xiaomi introduced a federation‑based partitioning scheme:

Functional partitioning: Deploy multiple Prometheus servers (master and slaves) per data center, each responsible for a subset of jobs.

Horizontal scaling: Split high‑cardinality targets across slaves using hashmod relabeling.

Federation configuration (master scraping slaves):

job_name: federate-slave
honor_labels: true
metrics_path: '/federate'
params:
  'match[]':
    - '{__name__=~"pod:.*|node:.*"}'
kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
        - kube-system
relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    action: keep
    regex: prometheus-slave.*

Slave configuration using hashmod to partition node‑exporter targets:

- job_name: kubelet
  scheme: https
  kubernetes_sd_configs:
    - role: node
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
    - source_labels: []
      regex: '__meta_kubernetes_node_label_(.+)'
      replacement: "$1"
      action: labelmap
    - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
      modulus: ${modulus}
      target_label: __tmp_hash
      action: hashmod
    - source_labels: [__tmp_hash]
      regex: ${slaveId}
      action: keep

Deployment Strategy

All Prometheus components (master, kube‑state, slaves) are deployed via Deployment or StatefulSet. Slave instances use a StatefulSet so each pod gets a deterministic name (slave‑0, slave‑1). prom‑reloader watches ConfigMaps and injects the pod name to generate per‑slave configuration.

Testing and Validation

Functional tests compared metric aggregates before and after partitioning, showing >95 % of time‑series differences within 1 %.

Performance tests on a cluster with 1 000 virtual nodes demonstrated:

Master Prometheus can scrape up to 80 k pods within a 1‑minute interval; bottleneck is kube‑state‑metrics scaling.

Slave Prometheus instances handle >400 nodes (≈60 pods/node) each; remote write adds modest overhead.

Overall, the partitioned architecture supports up to 80 k pods, meeting projected growth.

3. Outlook

Future work includes automatic scaling of the monitoring stack, performance optimisation of kube‑state‑metrics, tighter integration with prometheus‑operator and Helm for simplified deployment, and applying advanced analytics to monitoring data for capacity‑prediction and proactive scaling.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

cloud-native Kubernetes Prometheus Remote Storage

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.