Cloud Native 16 min read

Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus and OpenTSDB

This article details Xiaomi's end‑to‑end, highly available Kubernetes monitoring solution that combines Prometheus, OpenTSDB, and Falcon to handle massive dynamic metrics, ensure persistent storage, and support seamless scaling across multiple clusters.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Building a Scalable, High‑Availability Kubernetes Monitoring System with Prometheus and OpenTSDB

When Monitoring Meets K8s

Xiaomi's elastic scheduling platform (Ocean) and container platform rely on Kubernetes, requiring a robust monitoring system to maintain service quality. Unlike traditional physical hosts, each container acts as a host, leading to over 10,000 metrics per node and demanding dynamic, scalable monitoring that integrates with the existing Open‑Falcon alarm system.

The monitoring challenges include:

More dimensions: physical host metrics plus core services (apiserver, etcd), container, pod, and namespace metrics.

Dynamic objects: containers are created and destroyed frequently, preventing static pre‑configuration.

Metric explosion: massive metric volume requires efficient processing and visualization.

Need for dynamic scaling as the cluster grows.

Additional internal constraints are the diversity of clusters (fusion‑cloud, Ocean, CloudML), Open‑Falcon’s lack of native pull‑based collection, and the requirement for long‑term persistent storage.

Monitoring Solution and Evolution

Initial Solution

The first implementation used Falcon together with custom exporters to collect core metrics such as pod CPU, memory, and network usage. While it provided basic visibility, it missed many components (e.g., apiserver, etcd), lacked persistent storage, and required manual exporter development.

Prometheus‑Based System

After evaluation, Prometheus was chosen for its native Kubernetes support, high performance (up to 100k metrics per second), and powerful PromQL query language. The architecture includes:

Data sources: node‑exporter, kube‑state‑metrics, cAdvisor, and custom metrics exposed via annotations.

Prometheus deployment as a Pod with Prom‑Reloader for hot‑reloading configuration stored in ConfigMaps.

Storage backend: local TSDB for short‑term data and remote write to OpenTSDB (via an adapter) for long‑term persistence.

Alerting and visualization: metrics are forwarded to Open‑Falcon for alarms and dashboards, while OpenTSDB stores historical data.

Prometheus runs in Deployment/DaemonSet form, ensuring reliability, and all configuration files are managed via ConfigMaps.

Remote storage is implemented through Prometheus’s remote_write and remote_read interfaces, sending samples to OpenTSDB and reading them back when needed.

Issues with a Single‑Instance Setup

As the cluster grew, the single Prometheus instance caused:

High pressure on Falcon‑agent, leading to data loss when traffic exceeded 150,000 samples per minute.

Significant CPU and memory consumption, with occasional scrape failures and performance bottlenecks.

Partitioned Monitoring Solution

To overcome these limits, a federated Prometheus architecture was adopted, splitting monitoring into master, slave, and kube‑state Prometheus instances.

Two partitioning strategies are used:

Functional partition : Different jobs are assigned to separate Prometheus servers across data centers, with a central master aggregating results.

Horizontal scaling : Large target sets are divided among multiple slave instances using hashmod relabeling.

The master Prometheus fetches data from slaves via the /federate endpoint, as shown in the configuration snippet below:

- job_name: federate-slave
  honor_labels: true
  metrics_path: '/federate'
  params:
    'match[]':
      - '{__name__=~"pod:.*|node:.*"}'
  kubernetes_sd_configs:
  - role: pod
    namespaces:
      names:
      - kube-system
  relabel_configs:
  - source_labels:
    - __meta_kubernetes_pod_label_app
    action: keep
    regex: prometheus-slave.*

Slave instances use hashmod to partition node‑level metrics:

- job_name: kubelet
  scheme: https
  kubernetes_sd_configs:
  - role: node
  tls_config:
    insecure_skip_verify: true
  relabel_configs:
  - source_labels: []
    regex: __meta_kubernetes_node_label_(.+)
    replacement: "$1"
    action: labelmap
  - source_labels: [__meta_kubernetes_node_label_kubernetes_io_hostname]
    modulus: ${modulus}
    target_label: __tmp_hash
    action: hashmod
  - source_labels: [__tmp_hash]
    regex: ${slaveId}
    action: keep

Deployment uses StatefulSets for slaves (allowing per‑pod configuration via indexed ConfigMaps) and Deployments for master and kube‑state instances. Prom‑Reloader continuously watches for configuration changes.

Testing and Validation

Functional tests showed that over 95% of time‑series comparisons stayed within 1% error after partitioning. Performance tests on a cluster with 1,000 virtual nodes demonstrated that the master can handle up to 80k pods per minute, while each slave comfortably scrapes over 400 nodes (≈60 pods per node). Remote write latency grows with load, indicating future work on the Remote‑Storage‑Adapter.

Outlook

The partitioned solution is now deployed in several clusters, offering high availability, persistent storage, and dynamic scaling. Future improvements include automatic scaling of monitoring components, performance optimization of kube‑state‑metrics, deployment simplification via prometheus‑operator and Helm, and advanced analytics (e.g., capacity prediction) on the collected metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringCloud Nativehigh availabilityKubernetesPrometheusOpenTSDBFederation
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.