Cloud Native 17 min read

Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus

This article details the design and evolution of a highly available, persistent, and dynamically adjustable Kubernetes monitoring solution at Xiaomi, covering initial Falcon‑based approaches, the transition to Prometheus with remote storage via OpenTSDB, federation‑based partitioning, deployment strategies, performance testing, and future enhancements.

ITPUB

Jun 21, 2019

Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus

Background and Challenges

Xiaomi’s elastic scheduling platform (Ocean) and container platform rely on Kubernetes (k8s) to provide services. Monitoring such large‑scale k8s clusters is complex because each container behaves like a host, leading to thousands of metrics per node (over 10,000) and dynamic creation/destruction of pods.

Key challenges include:

More monitoring dimensions (core services, containers, pods, namespaces, etc.).

Dynamic and volatile monitoring targets.

Explosive growth of metrics with container scale.

Need for dynamic scaling of the monitoring system.

Additional constraints stem from Xiaomi’s internal environment:

Multiple heterogeneous k8s clusters (fusion‑cloud, Ocean, CloudML) with differing deployment, network, and storage models.

Open‑Falcon, the company‑wide monitoring/alerting system, does not support the pull‑based collection model required by k8s and lacks flexible aggregation for hierarchical resources.

Requirement to persist monitoring data using existing databases for long‑term analysis.

Existing Open‑Source Monitoring Options

Two common solutions were evaluated:

Heapster/Metrics‑Server + InfluxDB + Grafana – Simple deployment but limited to node‑level metrics and unsuitable for full‑cluster k8s monitoring.

Exporter + Prometheus + Adapter – Offers multi‑dimensional metrics, powerful PromQL queries, and dynamic service discovery, though it may lose data under 100 % accuracy requirements.

Initial Falcon‑Based Monitoring

The first implementation leveraged Open‑Falcon and custom exporters to collect core metrics (CPU, memory, network) from pods, nodes, and the cluster. Architecture is shown below:

Data sources included cadvisor‑exporter for container metrics, kube‑state‑exporter for pod‑level metrics, and falcon‑agent for physical node metrics. This solution lacked comprehensive coverage (e.g., apiserver, etcd) and did not provide persistent storage.

Migration to Prometheus

After evaluating shortcomings, Prometheus was selected for its native k8s support, high performance (up to 100 k metrics per second), and powerful PromQL query language.

Key components:

Data Sources : node‑exporter, kube‑state‑metrics, cadvisor, plus custom metrics exposed via prometheus.io/scrape: "true" annotations.

Prometheus Processing : Deployed as a Pod with Prom‑Reloader for hot‑reloading configuration stored in a ConfigMap.

Storage Backend : Local TSDB for short‑term data and remote write to OpenTSDB (via a Falcon‑Adapter) for long‑term persistence.

Architecture diagram:

Deployment Model

All monitoring components are deployed as Deployments or DaemonSets to ensure reliability. Configuration files reside in ConfigMaps with automatic reload via Prom‑Reloader. Storage uses both local blocks (2‑hour windows) and remote write to OpenTSDB. Remote read/write interfaces allow Prometheus to interact with any third‑party storage.

Limitations of a Single Prometheus Instance

As cluster size grew, two main issues emerged:

Falcon‑agent and its transfer component became bottlenecks, causing data loss when ingesting >150 000 samples per minute.

Prometheus CPU and memory usage increased sharply; a single instance struggled to scrape all targets, leading to occasional missed metrics.

Performance chart of an online cluster:

Partitioned Monitoring Solution (Federation)

To address scalability, a federated Prometheus architecture was introduced, consisting of a master Prometheus, multiple slave Prometheus instances, and a dedicated kube‑state Prometheus.

Federation allows the master to scrape aggregated data from slaves, reducing load on any single node.

Two partitioning strategies are used:

Functional Partitioning : Different Prometheus instances handle distinct jobs (e.g., per data center).

Horizontal Scaling : The same job is split across multiple slaves using hashmod relabeling.

Configuration example for the master to federate slaves:

- job_name: federate-slave
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
    - '{__name__=~"pod:.*|node:.*"}'
  kubernetes_sd_configs:
  - role: pod
  namespaces:
    names:
    - kube-system
  relabel_configs:
  - source_labels: [__meta_kubernetes_pod_label_app]
    action: keep
    regex: prometheus-slave.*

Slave configuration using hashmod to slice node targets:

- job_name: kubelet
  scheme: https
  kubernetes_sd_configs:
  - role: node
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: []
    regex: __meta_kubernetes_node_label_(.+)
    replacement: "$1"
    action: labelmap
  - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
    modulus: ${modulus}
    target_label: __tmp_hash
    action: hashmod
  - source_labels: [__tmp_hash]
    regex: ${slaveId}
    action: keep

Testing and Performance Validation

Functional tests compared aggregated metrics before and after partitioning, showing >95 % of time‑series differences within 1 %.

Performance tests on a cluster with 1,000 virtual nodes evaluated different load scenarios:

Results indicated that master and kube‑state Prometheus can handle up to 80 k pods per minute, while each slave comfortably scrapes >400 nodes (≈60 pods per node). Remote write to OpenTSDB adds modest overhead.

Future Outlook

The partitioned solution is now deployed in several clusters, offering high availability, persistent storage, and dynamic scaling. Planned improvements include automatic scaling of monitoring components, performance optimization of kube‑state‑metrics, deployment simplification via prometheus‑operator and Helm, and advanced analytics such as capacity‑prediction algorithms.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes prometheus OpenTSDB FALCON

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.