Cloud Native 30 min read

Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

This article benchmarks Alibaba Cloud Prometheus against the open‑source Prometheus across multiple cluster sizes, churn rates, and query patterns, revealing that while the open‑source version remains stable under light load, its CPU and memory usage grow non‑linearly with high cardinality, whereas Alibaba's managed service delivers higher compatibility, better query performance, and more predictable scaling.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Prometheus vs Open‑Source Prometheus: Deep Performance Benchmark

Gartner’s 2023 Top Strategic Technology Trends highlighted Application Observability , leading many organizations to adopt Prometheus for metric collection and alerting. While Prometheus appears simple to deploy, its internal architecture faces challenges such as high‑cardinality labels, high churn of time series, and long‑range queries, which can cause storage pressure, increased I/O, and query latency.

Problem Statement

High‑cardinality (many distinct label values) and high churn (frequent creation and deletion of time series) amplify storage and CPU demands, especially in cloud‑native environments where thousands of pods generate millions of series.

Test Methodology

We compared Alibaba Cloud Prometheus Service with open‑source Prometheus 2.40.1 (latest as of 2022‑12‑20) using the prometheus‑benchmark tool (forked and patched to fix query_range bugs). The benchmark source is available at https://github.com/liushv0/prometheus-benchmark. All tests ran on Alibaba Cloud ECS instances in the Zhangjiakou region.

Test Scenarios

Small cluster: 100 targets (~6.8k samples/sec)

Medium cluster: 500 targets

Large cluster: 2000 targets

High churn rates: 10 %–99 % target turnover every 10 min

Range queries: 1 h, 3 h, 6 h, 24 h

Long‑duration queries: 5‑day and 7‑day spans

Each scenario measured:

Write throughput (KB/s)

Query QPS

95th‑percentile query latency (P95)

Memory and CPU usage

Key Results

Compatibility

Using the Prometheus compliance test suite ( https://github.com/prometheus/compliance), Alibaba Cloud Prometheus achieved 97.06 % compatibility , outperforming comparable managed services on AWS and GCP.

Performance – Small Cluster (Alert Queries)

Open‑source Prometheus remained stable for six hours with CPU < 60 % and modest memory usage.

Performance – Small Cluster (Range Queries)

Adding range queries caused CPU to rise to ~60 % within three hours, increased latency, and the open‑source instance hit OOM after eight hours.

High‑Churn Scenarios

Both versions showed linear memory growth; the open‑source binary failed earlier (OOM after ~8 h). Alibaba Cloud Prometheus handled up to ~5 million time series without degradation.

Scaling Tests

Increasing hardware four‑fold (16 C/64 G) improved open‑source capacity only ~2.5×, confirming non‑linear scaling of time‑series workloads.

Long‑Duration Queries

For 5‑day and 7‑day queries, Alibaba Cloud Prometheus responded in ~13 s total (≈5 s per request) versus ~53 s total (≈16‑37 s per request) for open‑source, due to operator push‑down, down‑sampling, and optimized TSDB file handling.

Conclusions

Write throughput is rarely the bottleneck; both systems sustain high ingest rates.

CPU consumption is dominated by PromQL evaluation, especially with high cardinality or long‑range queries.

Memory usage grows faster than linearly with the number of time series, limiting scalability of the open‑source version.

Alibaba Cloud Prometheus provides higher compatibility, better query performance, and more predictable scaling in high‑cardinality and high‑churn environments.

Vertical scaling of open‑source Prometheus yields diminishing returns; horizontal scaling or managed services are more cost‑effective for large clusters.

Sample PromQL Queries Used

sum(kube_pod_container_resource_requests_memory_bytes) by (node, namespace)
(sum(kube_pod_container_resource_requests{resource=~"cpu"}) by (node) / sum(kube_node_status_allocatable{resource=~"cpu"}) by (node)) * 100 > 30
count(kube_pod_container_info{}) by (namespace, node)
sum by (node, namespace) (kube_pod_container_resource_requests_memory_bytes)
sum by (node, namespace) (kube_pod_container_resource_requests_cpu_cores)
sum by (node, namespace) (kube_pod_container_resource_limits_cpu_cores)
sum by (node, namespace) (kube_pod_container_resource_limits_memory_bytes)
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringCloud NativeObservabilitymetricsperformance benchmarkPrometheus
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.