Designing a Scalable, High‑Availability Monitoring System with Prometheus & Thanos
This article explores the challenges of building a reliable monitoring platform, compares open‑source solutions such as Elasticsearch, Nagios, Zabbix and Prometheus, and details how to achieve high availability and horizontal scaling using Prometheus, Thanos, sharding, remote‑write, and Kubernetes orchestration.
Laughing about Monitoring Systems
As time passes, the risk of failures increases and incidents are often unexpected. Manual operations make fault location and handling difficult, and longer outages cause greater loss, so mature teams need a reliable monitoring system.
A complete monitoring system must never fail itself; even if the platform crashes it should still emit alerts. High availability of the monitoring system is a constant goal. Below are the functions a complete monitoring system should consider.
Problems faced by monitoring system design
Monitoring covers many roles, which we divide into categories: servers, containers, services/applications, network, storage, middleware. Different categories use different collectors according to industry solutions.
Functional questions to consider:
Support tagging of metric sources to clarify business origin.
Support aggregation operations to transform, combine, and analyze metrics.
Alarms, reports, and graphical dashboards.
Persist historical data for traceability.
Usability considerations:
Allow adding/removing and customizing monitoring items.
Support expression‑based calculations.
Automatic discovery of new servers or pods.
Configurable alarm policies with custom thresholds.
Solution selection
Considering the above, which open‑source solutions are suitable? Common choices are Elasticsearch, Nagios, Zabbix, Prometheus. Other solutions are omitted.
Elasticsearchis a real‑time distributed search and analytics engine, usually paired with Logstash and Kibana (the ELK stack) and excels at document log search. Nagios: advantages include automatic restart of failing servers/applications, log rotation, flexible configuration, distributed monitoring, and diverse alarm settings. Drawbacks are weak event console, poor plugin usability, limited handling of performance/traffic metrics, no historical data view, and complex configuration. Zabbix: easy to get started, powerful, but deep requirements need extensive customization and secondary development. Prometheus: satisfies almost all requirements, can integrate with Grafana for visualization, uses promQL for aggregation, supports tagging, and has a large community providing collectors and high‑availability solutions.
Overall, Prometheus is the most suitable.
Prometheus and its drawbacks
Prometheus architecture shows that it deploys exporters on clients to collect data, while the server pulls data from them.
Clients can also push data to PushGateway for Prometheus to pull.
Prometheus has automatic discovery for platforms like Azure, Consul, OpenStack, and can tag resources; custom code can extend this.
Prometheus supports alerting via AlertManager, which can forward alerts via webhook to email/SMS.
The main issues are performance under high load and high availability.
Problems of single‑node Prometheus deployment
Prometheus is designed for single‑node deployment; scaling by adding resources helps but common problems remain:
Collection rate can be limited by CPU/network, causing metric loss if the scrape interval is missed.
Query speed suffers for large time ranges, putting pressure on disk.
Single‑node failure results in total service outage.
High‑load solutions for a single node
When load is high, horizontal scaling and load balancing are needed. Prometheus provides grouping capabilities.
Sharding data across multiple Prometheus instances introduces three problems:
Data becomes scattered, making operations difficult.
Switching data sources loses a global view.
One solution is to add a remote write storage layer to aggregate data.
The storage should be a TSDB that supports scaling and high availability. Querying then requires an additional component, which loses native query language; alternatively, use federated Prometheus nodes.
Automatic scaling can be achieved by monitoring node load, maintaining service start/stop, and updating Prometheus scrape ranges. Kubernetes (k8s) can handle container orchestration, but modifying Prometheus node configuration still requires custom solutions.
Configure pod anti‑affinity for Prometheus.
Write a scheduler that uses the k8s API to detect Prometheus node status.
Use k8s to detect node failures and load, hash‑distribute pressure, and extend Prometheus' service discovery with hostnames.
This approach eliminates the need to edit ConfigMaps because Prometheus updates its scrape range via its API.
Scale Prometheus without ConfigMaps.
For monitoring additional services like Redis, a dedicated Prometheus instance can be created or the same instance can monitor all nodes with deduplication at a higher layer.
Sharding solves pressure but does not solve data aggregation queries or single‑point data loss.
Federated deployment aggregates queries but concentrates load again; redundancy solves single‑point failure but doubles client pull traffic.
How to guarantee no data loss on single‑point failure
To avoid data loss, integrate the high‑availability solution Thanos, make Prometheus stateless, and enable remote write to Thanos.
Prometheus no longer stores data locally; as long as enough nodes exist, new nodes are auto‑scaled, and load is balanced during the transition.
Prometheus buffers metrics in memory before sending to remote storage; increasing write throughput requires adjusting queue_config.
Thanos is a non‑intrusive HA solution that aggregates, deduplicates, compresses, stores, queries, and alerts on Prometheus data, exposing the same query API as Prometheus.
Implementing a distributed HA monitoring system
How would you build such a component?
Write sharded data to storage, other components communicate with storage; this is the mainstream Thanos approach.
All components communicate with object storage for data persistence and retrieval.
Use object storage as the storage engine.
Deploy a sidecar with each Prometheus node to periodically push data to object storage.
The Ruler component evaluates alerts and performs aggregation.
The Compact component compresses data into 1‑minute, 5‑minute, and 1‑hour blocks.
The Query component talks to other components via gRPC through a gateway, not directly to object storage.
The sidecar caches recent data locally (default 2 h) and pushes older data to object storage; queries combine local and remote data.
For small clusters without network pressure, a sidercar can be used.
Do not store data on the receiver side
Coupling a sidecar with Prometheus violates container simplicity and increases storage pressure; separating them is advisable.
My idea: collect data, push it, store it, and let other components communicate with storage.
The Receive component implements the remote‑write interface; Prometheus can push data to Receive, which acts like a Prometheus without collection capability, making Prometheus stateless.
Data in object storage is immutable.
Prometheus writes to a WAL, which Receive also does; after a timeout, blocks are generated and uploaded.
The Query component reads recent data from Receive and older data from object storage.
Receive uses k8s DNS SRV for service discovery instead of the built‑in load balancer.
Receive hashes incoming traffic and distributes it across nodes; k8s service can automatically round‑robin.
To avoid data loss when a Receive node fails, configure a replication factor; remote‑write retries on 503, and additional replicas handle the load.
Business metric calculation issues
Complex business metrics should be collected by custom exporters and processed by the Ruler. If an exporter cannot be written, use a k8s job to push data to PushGateway, then let Prometheus scrape it.
Follow exporter development standards to avoid duplicate metrics. Use PushGateway's delete API to remove stale data.
Dynamic alarm policy updates / alarm record storage
Generate alarm policies via a service that creates a ConfigMap and notifies the Ruler for hot‑reload. Mount the ConfigMap into the Ruler; watch for changes (default Watch strategy).
Panorama view
Final notes
A mature monitoring system should also provide operational reports, low‑load reports, and eventually AI‑based fault prediction.
Operational fault reports and resource daily/weekly/monthly reports for trend analysis.
Low‑load reports to analyze server utilization.
Combine AI with fault trends for predictive alerts.
Last remarks
For full‑k8s cluster monitoring, the Prometheus Operator simplifies creation of resources such as Prometheus, ServiceMonitor, AlertManager, etc., turning monitoring into manipulation of k8s objects.
Monitoring can drive horizontal scaling via the Prometheus Adapter, and can also support automated remediation, potentially reducing the need for human operators.
References and further reading
7 open‑source cloud monitoring tools you must know.
Thanos practice in TKEStack.
Prometheus Remote Write configuration.
Thanos – highly available Prometheus setup with long‑term storage.
xxHash – extremely fast non‑cryptographic hash algorithm.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
