Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark
This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.
Gartner’s 2023 Top Strategic Technology Trends highlighted Application Observability , leading many organizations to adopt Prometheus for metric collection and alerting. While Prometheus appears simple to deploy, its internal architecture faces challenges such as high‑cardinality labels, high churn of time series, and long‑range queries, which can cause storage pressure, increased I/O, and query latency.
Problem Statement
High‑cardinality (many distinct label values) and high churn (frequent creation and deletion of time series) amplify storage and CPU demands, especially in cloud‑native environments where thousands of pods generate millions of series.
Test Methodology
We compared Alibaba Cloud Prometheus Service with open‑source Prometheus 2.40.1 (latest as of 2022‑12‑20) using the prometheus‑benchmark tool (forked and patched to fix query_range bugs). The benchmark source is available at https://github.com/liushv0/prometheus-benchmark. All tests ran on Alibaba Cloud ECS instances in the Zhangjiakou region.
Test Scenarios
Small cluster: 100 targets (~6.8k samples/sec)
Medium cluster: 500 targets
Large cluster: 2000 targets
High churn rates: 10 %–99 % target turnover every 10 min
Range queries: 1 h, 3 h, 6 h, 24 h
Long‑duration queries: 5‑day and 7‑day spans
Each scenario measured:
Write throughput (KB/s)
Query QPS
95th‑percentile query latency (P95)
Memory and CPU usage
Key Results
Compatibility
Using the Prometheus compliance test suite ( https://github.com/prometheus/compliance), Alibaba Cloud Prometheus achieved 97.06 % compatibility , outperforming comparable managed services on AWS and GCP.
Performance – Small Cluster (Alert Queries)
Open‑source Prometheus remained stable for six hours with CPU < 60 % and modest memory usage.
Performance – Small Cluster (Range Queries)
Adding range queries caused CPU to rise to ~60 % within three hours, increased latency, and the open‑source instance hit OOM after eight hours.
High‑Churn Scenarios
Both versions showed linear memory growth; the open‑source binary failed earlier (OOM after ~8 h). Alibaba Cloud Prometheus handled up to ~5 million time series without degradation.
Scaling Tests
Increasing hardware four‑fold (16 C/64 G) improved open‑source capacity only ~2.5×, confirming non‑linear scaling of time‑series workloads.
Long‑Duration Queries
For 5‑day and 7‑day queries, Alibaba Cloud Prometheus responded in ~13 s total (≈5 s per request) versus ~53 s total (≈16‑37 s per request) for open‑source, due to operator push‑down, down‑sampling, and optimized TSDB file handling.
Conclusions
Write throughput is rarely the bottleneck; both systems sustain high ingest rates.
CPU consumption is dominated by PromQL evaluation, especially with high cardinality or long‑range queries.
Memory usage grows faster than linearly with the number of time series, limiting scalability of the open‑source version.
Alibaba Cloud Prometheus provides higher compatibility, better query performance, and more predictable scaling in high‑cardinality and high‑churn environments.
Vertical scaling of open‑source Prometheus yields diminishing returns; horizontal scaling or managed services are more cost‑effective for large clusters.
Sample PromQL Queries Used
sum(kube_pod_container_resource_requests_memory_bytes) by (node, namespace) (sum(kube_pod_container_resource_requests{resource=~"cpu"}) by (node) / sum(kube_node_status_allocatable{resource=~"cpu"}) by (node)) * 100 > 30 count(kube_pod_container_info{}) by (namespace, node) sum by (node, namespace) (kube_pod_container_resource_requests_memory_bytes) sum by (node, namespace) (kube_pod_container_resource_requests_cpu_cores) sum by (node, namespace) (kube_pod_container_resource_limits_cpu_cores) sum by (node, namespace) (kube_pod_container_resource_limits_memory_bytes)Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
