Cloud Native 56 min read

Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

This article compares open‑source Prometheus + Grafana with managed cloud monitoring services, evaluating deployment complexity, functionality, scalability, security, and total cost of ownership across small, medium, and large workloads, and provides practical decision‑making guidance for teams of different sizes and requirements.

MaGe Linux Operations
MaGe Linux Operations
MaGe Linux Operations
Prometheus vs Cloud Provider Monitoring: Which Is the Most Cost‑Effective Choice for 2025?

Cloud‑Native Monitoring: Prometheus+Grafana or Cloud Provider Solutions? The Ultimate Cost‑Effective Choice

"The monitoring system was set up in three days, and the cloud bill increased by 50,000 yuan in the first month – is this normal?" This is a common complaint from a technical director last year. As an SRE with ten years of monitoring experience, I have seen many teams stumble when choosing a monitoring solution: some blindly build a Prometheus cluster and lose control of operational costs, some are scared off by the high price of cloud‑provider monitoring, and some skip monitoring altogether and suffer frequent failures. In 2025, cloud‑native monitoring has become a standard part of infrastructure, yet the decision between the open‑source Prometheus + Grafana stack and managed cloud services still confuses many teams. This article, based on real‑world project experience, compares the two approaches across cost, features, and operational complexity to help you find the most suitable monitoring solution and potentially save hundreds of thousands of yuan annually.

Technical Background: Evolution of Cloud‑Native Monitoring and Core Requirements

Why Is Monitoring Critical?

In the cloud‑native era, monitoring is no longer a "nice‑to‑have"; it is the lifeline for system stability and business continuity.

Core Problems Solved by Monitoring :

Real‑time visibility – can you see the health of 100 micro‑services at a glance?

Rapid fault isolation – when an alarm fires at 3 am, can you locate the root cause within five minutes?

Performance optimization – which interfaces are slow, which resources are heavily used?

Capacity planning – what is the current resource utilization and when do we need to scale?

SLA assurance – can we prove 99.9 % availability?

Cost optimization – which services have low utilization and can be trimmed?

2025 industry data shows:

Teams with a complete monitoring system reduce MTTR by 80 %.

Monitoring‑driven performance optimization cuts cloud costs by 30‑50 %.

Poor monitoring can cause cost overruns of 100‑300 %.

Three Generations of Monitoring

First Generation (2005‑2013): Traditional Monitoring

Typical tools: Nagios, Zabbix, Cacti

Characteristics: script‑based active checks, complex configuration, not cloud‑friendly

Problems: cannot handle micro‑services and containers

Second Generation (2013‑2018): Rise of Cloud‑Native Monitoring

Typical tools: Prometheus, InfluxDB, Grafana

Features: pull‑based monitoring, dynamic service discovery, powerful PromQL

Advantages: free, feature‑rich, active community

Third Generation (2018‑present): Managed Monitoring Services

Typical services: AWS CloudWatch, Azure Monitor, Google Cloud Monitoring, Datadog, New Relic, Dynatrace, Alibaba Cloud ARMS, Tencent Cloud Prometheus, Huawei Cloud AOM

Features: fully managed, out‑of‑the‑box, pay‑as‑you‑go

Advantages: no operational overhead, but cost can be high

Two Main Approaches

Prometheus + Grafana (Open‑Source Self‑Built)

Prometheus :

Born in 2012, CNCF graduated project (second after Kubernetes)

Key features: multi‑dimensional data model, powerful PromQL, pull model, service discovery, alerting rules

Market position: de‑facto standard for cloud‑native monitoring, first choice for Kubernetes

Grafana :

Born in 2014, open‑source visualization platform

Key features: supports many data sources (Prometheus, InfluxDB, Elasticsearch, etc.), powerful dashboards, alert notifications

Market position: No. 1 in visualization, perfect partner for Prometheus

Typical architecture:

Application (Exporter exposes metrics) ← Prometheus (scrape, store, query)
                                          ↓
                                      Grafana (visualization)
                                          ↓
                                   Alertmanager (alert notification)

Market share: about 60 % in cloud‑native scenarios.

Managed Cloud Provider Monitoring

AWS CloudWatch :

AWS native monitoring, deep integration with AWS services

Features: no installation, automatic metric collection, pay‑per‑use

Alibaba Cloud ARMS / Prometheus :

ARMS: real‑time application monitoring, APM + monitoring

Prometheus: managed Prometheus service, CNCF compatible

Features: deep integration with Alibaba Cloud services, fully managed

Datadog :

Third‑party SaaS monitoring, cloud‑agnostic

Features: full‑stack observability (infrastructure, APM, logs, tracing, RUM, synthetic monitoring, security monitoring)

AI capabilities: Watchdog anomaly detection, automated root‑cause analysis, smart alert noise reduction

Integrations: 500+ integrations (AWS, Azure, GCP, Kubernetes, databases, middleware, etc.)

Comparison for Small‑to‑Medium Teams (10‑200 people)

Key requirements for small‑to‑medium teams:

Quick start – no dedicated SRE, developers must be able to configure quickly

Cost control – budget limited, monitoring cost should not exceed 15 % of server cost

Good enough – basic monitoring + alerts are sufficient

Simple operations – no continuous maintenance manpower

Scalability – should scale smoothly as business grows

These requirements lead to a trade‑off: balance cost, functionality, and operational complexity rather than chasing the most feature‑rich solution.

Core Content: Prometheus + Grafana vs Cloud Provider Solutions – Full Comparison

1. Deployment Complexity and Onboarding Time

Prometheus + Grafana: Requires Technical Skills

Minimal production deployment (single‑node):

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.48.0
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    ports:
      - 9090:9090
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.2.0
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
      - GF_USERS_ALLOW_SIGN_UP=false
    ports:
      - 3000:3000
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.26.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    ports:
      - 9093:9093
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:

Prometheus configuration (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerts/*.yml'

scrape_configs:
  # Scrape Prometheus itself
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # Scrape Node Exporter (system metrics)
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']
    labels:
      env: 'production'

  # Scrape Spring Boot applications
  - job_name: 'spring-boot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['app1:8080', 'app2:8080', 'app3:8080']

  # Kubernetes service discovery (if using K8s)
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: ['__meta_kubernetes_pod_annotation_prometheus_io_scrape']
        action: keep
        regex: 'true'
      - source_labels: ['__meta_kubernetes_pod_annotation_prometheus_io_path']
        action: replace
        regex: '(.+)'
        target_label: '__metrics_path__'
      - source_labels: ['__address__', '__meta_kubernetes_pod_annotation_prometheus_io_port']
        action: replace
        regex: '([^:]+)(?::\d+)?;(\d+)'
        replacement: '$1:$2'
        target_label: '__address__'

Alert rules example (alerts/rules.yml):

groups:
  - name: instance
    rules:
      # Instance down alert
      - alert: InstanceDown
        expr: up == 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Instance {{ $labels.instance }} is down"
          description: "{{ $labels.job }} instance {{ $labels.instance }} has been down for more than 5 minutes"

      # CPU usage > 80 %
      - alert: HighCPUUsage
        expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} CPU usage high"
          description: "CPU usage is {{ $value }} %"

      # Memory usage > 85 %
      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Instance {{ $labels.instance }} memory usage high"
          description: "Memory usage is {{ $value }} %"

Grafana datasource configuration (grafana-datasources.yml):

apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

Spring Boot application instrumentation:

<!-- pom.xml -->
<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
<dependency>
  <groupId>org.springframework.boot</groupId>
  <artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
# application.yml
management:
  endpoints:
    web:
      exposure:
        include: health,info,metrics,prometheus
  metrics:
    export:
      prometheus:
        enabled: true
        tags:
          application: ${spring.application.name}
        env: production

Deployment time :

Single‑node: 4‑6 hours (including learning and configuration)

Production HA: 2‑3 days (including Prometheus cluster, Thanos/Cortex, etc.)

Learning cost :

PromQL basics: 3‑5 days

Grafana dashboard creation: 1‑2 days

Alert rule writing: 2‑3 days

Complexity rating : ⭐⭐⭐⭐ (requires solid technical skills)

Managed Cloud Provider Solutions – Ready‑to‑Use, 30‑Minute Start

AWS CloudWatch :

No installation, automatic metric collection for AWS resources

Features: pay‑per‑use, web UI, integration with SNS for notifications

Example: Custom metric from Python

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')
cloudwatch.put_metric_data(
    Namespace='MyApp',
    MetricData=[
        {
            'MetricName': 'OrderCount',
            'Value': 123,
            'Unit': 'Count',
            'Timestamp': datetime.utcnow(),
            'Dimensions': [
                {'Name': 'Environment', 'Value': 'production'},
                {'Name': 'Service', 'Value': 'order-service'}
            ]
        }
    ]
)

Alibaba Cloud ARMS (zero‑code Java agent example)

// Download ARMS agent
java -javaagent:/path/to/arms-agent.jar \
    -Darms.licenseKey=your-license-key \
    -Darms.appName=order-service \
    -jar your-app.jar

Configuration via web UI (set thresholds, notification channels, etc.) – takes about 30 minutes.

Deployment time :

Cloud services (EC2/ECS/Lambda): 0 minutes (automatic monitoring)

Self‑built applications: 30 minutes‑1 hour (install agent or SDK)

Learning cost :

Basic usage: 30 minutes‑1 hour

Advanced features: 1‑2 days

Complexity rating : ⭐ (very easy)

2. Feature Completeness Comparison

Prometheus + Grafana – Open‑source, Flexible

Core capabilities :

Metric collection – 200+ exporters (Node, MySQL, Nginx, etc.) and custom client libraries (Go, Java, Python, Node.js)

Storage & query – local TSDB, powerful PromQL, remote storage (Thanos, Cortex, VictoriaMetrics) for long‑term storage and horizontal scaling

Visualization – Grafana dashboards, 10,000+ community templates, variables, annotations, links

Alerting – Prometheus alert rules, Alertmanager with grouping, silencing, inhibition, multiple notification channels (email, Slack, DingTalk, WeChat, PagerDuty, webhook)

Scalability – federation, Thanos, Cortex for multi‑cluster and multi‑tenant setups

PromQL examples :

# 1. Current CPU usage
100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# 2. QPS per service
sum(rate(http_requests_total[5m])) by (service)

# 3. P95 response time
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service))

# 4. Predict disk full in 4 hours
predict_linear(node_filesystem_free_bytes[1h], 4 * 3600) < 0

# 5. Error rate > 5 %
(sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) * 100 > 5

Missing features compared to commercial solutions :

No out‑of‑the‑box APM (requires Jaeger, Zipkin, etc.)

No built‑in log integration (needs Loki or ELK)

No AI‑driven anomaly detection (requires custom rules)

RBAC only in Grafana Enterprise

No automatic root‑cause analysis

Managed Cloud Provider Solutions – Full‑Stack Observability

AWS CloudWatch :

Metric collection – automatic for AWS services, custom metrics via SDK/API

Storage & query – unlimited storage, CloudWatch Insights (SQL‑like queries)

Visualization – CloudWatch Dashboards (less flexible than Grafana)

Alerting – CloudWatch Alarms, SNS notifications, composite alarms

Advanced features – Logs Insights, ServiceLens (APM), Synthetics, Anomaly Detection (ML)

Alibaba Cloud ARMS :

APM – automatic tracing, service topology, slow SQL analysis, anomaly analysis, code hotspot analysis

Metrics – Prometheus compatible, custom monitoring

Alerting – intelligent baseline alerts, multi‑dimensional alerts, alert deduplication

Integration – tight with ECS, Kubernetes, Log Service (SLS)

Datadog :

Infrastructure monitoring – Prometheus‑like metrics

APM – full‑stack application performance monitoring

Log management – integrated ELK‑style log ingestion and search

Distributed tracing – OpenTelemetry support

RUM – real‑user monitoring for front‑end performance

Synthetic monitoring – proactive checks

Security monitoring – built‑in security signals

AI capabilities – Watchdog anomaly detection, automated root‑cause analysis, smart alert noise reduction

Integrations – 500+ integrations (cloud, containers, databases, middleware, etc.)

3. Cost Comparison – Key Decision Factors

Prometheus + Grafana Cost: Low at start, rises with scale

Hardware / cloud resource cost :

Scenario 1 (small, 10 services, 1 M requests/day):

Prometheus server: 2 CPU 4 GB, 50 GB SSD = 100 CNY/month
Grafana server: 1 CPU 2 GB = 50 CNY/month
Total: 150 CNY/month ≈ 1 800 CNY/year

Scenario 2 (medium, 50 services, 10 M requests/day):

Prometheus server: 4 CPU 8 GB, 200 GB SSD = 300 CNY/month
Grafana server: 2 CPU 4 GB = 100 CNY/month
Alertmanager: 1 CPU 2 GB = 50 CNY/month
Total: 450 CNY/month ≈ 5 400 CNY/year

Scenario 3 (large, 200 services, 100 M requests/day, 1‑year retention):

# Requires Thanos/Cortex for long‑term storage
Prometheus (3 instances, hot data 15 days): 3 × 8 CPU 16 GB, 200 GB SSD = 1 500 CNY/month
Thanos Store (object storage, cold data): 10 TB = 1 500 CNY/month (OSS 0.15 CNY/GB/month)
Thanos Query/Compactor: 4 CPU 8 GB = 300 CNY/month
Grafana: 2 CPU 4 GB = 100 CNY/month
Total: 3 400 CNY/month ≈ 40 800 CNY/year

Human cost (operations, maintenance, upgrades):

Initial setup: 2‑3 days (1 person)

Ongoing maintenance: 10‑20 hours/month (rule tuning, dashboard updates)

Incident handling: 5‑10 hours/month

Upgrade & scaling: 20 hours/year

Assuming an SRE salary of 400 k CNY/year, human cost ≈ 40 k CNY/year.

Learning cost :

PromQL + Grafana basics: 2 days per person

10‑person team: 20 person‑days ≈ 20 k CNY

Total first‑year cost :

Small: 1 800 + 10 000 + 10 000 = 22 800 CNY

Medium: 5 400 + 40 000 + 20 000 = 65 400 CNY

Large: 40 800 + 80 000 + 30 000 = 150 800 CNY

AWS CloudWatch Cost: Pay‑per‑use, high at scale

Pricing model (2025 US East, USD) :

Standard metrics: first 10 free, then $0.30 per metric/month

Custom metrics: $0.30 per metric/month

High‑resolution metrics: $0.90 per metric/month

API requests: $0.01 per 1 000 calls

Dashboard: $3 per dashboard/month (first 3 free)

Alarms: $0.10 per alarm/month (first 10 free), $0.30 for high‑resolution

Logs: $0.50 per GB ingested, $0.03 per GB stored/month, $0.005 per GB scanned

Scenario 1 (small) – 70 metrics, 20 custom, 20 alarms, 10 GB/day logs (7 days retention):

Metrics: (70‑10) × 0.30 = $18/month
Custom metrics: 20 × 0.30 = $6/month
Alarms: (20‑10) × 0.10 = $1/month
Dashboard: free
Logs: 300 GB ingest × $0.50 = $150/month
Storage: 70 GB × $0.03 = $2/month
Total: $177/month ≈ $1 200/month ≈ 14 400 CNY/year

Scenario 2 (medium) – 350 metrics, 100 custom, 50 alarms, 3 TB/day logs (7 days):

Metrics: 350 × 0.30 = $102/month
Custom: 100 × 0.30 = $30/month
Alarms: 50 × 0.10 = $4/month
Dashboard: 5 × $3 = $15/month
Logs: 3 000 GB ingest × $0.50 = $1 500/month
Storage: 700 GB × $0.03 = $21/month
Total: $1 663/month ≈ $11 340 CNY/month ≈ 134 400 CNY/year

Scenario 3 (large) – 2 000 metrics, 200 alarms, 30 TB/day logs (30 days):

Metrics: 2 000 × 0.30 = $600/month
Alarms: 200 × 0.10 = $19/month
Dashboard: 10 × $3 = $30/month
Logs: 30 000 GB ingest × $0.50 = $15 000/month
Storage: 30 000 GB × $0.03 = $900/month
Total: $16 540/month ≈ $111 300 CNY/month ≈ 1 335 600 CNY/year

Human cost is near zero (fully managed).

Alibaba Cloud ARMS Cost: Cheaper than AWS, still higher than self‑built

Pricing model (2025) :

Basic tier – free (limited)

Expert tier – Application monitoring: ¥0.005 per call (1 M calls = ¥5 000) Prometheus: ¥0.08 per million samples Custom monitoring: ¥0.03 per metric per hour

Scenario 1 (small) – 10 services, 1 M requests/day:

App monitoring: 30 M calls/month = ¥150/month
Prometheus samples: 10 M samples/day = 300 M/month → ¥24/month
SMS alerts: 100/month = ¥10/month
Total: ¥184/month ≈ ¥2 208/year

Scenario 2 (medium) – 50 services, 10 M requests/day:

App monitoring: 300 M calls/month = ¥1 500/month
Prometheus samples: 100 M samples/day = 3 000 M/month → ¥240/month
SMS alerts: ¥50/month
Total: ¥1 790/month ≈ ¥21 480/year

Scenario 3 (large) – 200 services, 100 M requests/day:

App monitoring: 3 B calls/month = ¥15 000/month
Prometheus samples: 1 B samples/day = 30 000 M/month → ¥2 400/month
SMS alerts: ¥200/month
Total: ¥17 600/month ≈ ¥211 200/year

Human cost is near zero; learning cost ≈ ¥3 000 for a 10‑person team.

Datadog Cost: Powerful but expensive

Pricing model (2025 US) :

Infrastructure Monitoring: $15 per host/month (annual commitment)

APM: $31 per host/month + $1.27 per million spans

Log Management: $0.10 per GB ingested + $1.27 per million log events

Synthetic Monitoring: $5 per 10 000 tests

Scenario 1 (small) – 10 hosts, 10 M spans/month, 10 GB/day logs:

Infrastructure: 10 × $15 = $150/month
APM: 10 × $31 + 10 M spans × $1.27/1 M = $310 + $13 = $323/month
Logs: 300 GB ingest × $0.10 = $30 + 3 M events × $1.27/1 M = $4 → $34/month
Total: $507/month ≈ $4 080/month ≈ 40 800 CNY/year

Scenario 2 (medium) – 50 hosts, 100 M spans/month, 100 GB/day logs:

Infrastructure: 50 × $15 = $750/month
APM: 50 × $31 + 100 M spans × $1.27/1 M = $1 550 + $127 = $1 677/month
Logs: 3 000 GB ingest × $0.10 = $300 + 30 M events × $1.27/1 M = $38 → $338/month
Total: $2 765/month ≈ $22 320/month ≈ 223 200 CNY/year

Scenario 3 (large) – 200 hosts, 1 B spans/month, 1 TB/day logs:

Infrastructure: 200 × $15 = $3 000/month
APM: 200 × $31 + 1 B spans × $1.27/1 M = $6 200 + $1 270 = $7 470/month
Logs: 30 000 GB ingest × $0.10 = $3 000 + 300 M events × $1.27/1 M = $381 → $3 381/month
Total: $13 851/month ≈ $111 000/month ≈ 1 112 800 CNY/year

Human cost near zero; learning cost ≈ $5 000 for a 10‑person team.

4. High Availability and Scalability Comparison

Prometheus + Grafana HA Options

Single‑node (small scale): simple but single point of failure.

HA (medium scale) : two Prometheus instances (active‑passive) + two Grafana behind a load balancer – eliminates single point, but data is not shared.

Thanos architecture (large scale) :

# Thanos components:
- Sidecar: runs beside each Prometheus, uploads data to object storage
- Store: reads historical data from object storage
- Query: unified query layer aggregating multiple Prometheus and Store data
- Compactor: compresses data in object storage
- Ruler: evaluates alert rules

Architecture:
Prometheus1 (Sidecar) ──┐
Prometheus2 (Sidecar) ──┼──→ Object storage (S3/OSS) ←── Store
Prometheus3 (Sidecar) ──┘
          ↓
        Query ──→ Grafana
          ↑
        Ruler

Configuration example (Prometheus + Thanos Sidecar):

# prometheus.yml
global:
  external_labels:
    cluster: 'prod-1'
    replica: '0'

# Start Prometheus
prometheus \
  --config.file=/etc/prometheus/prometheus.yml \
  --storage.tsdb.path=/prometheus \
  --storage.tsdb.min-block-duration=2h \
  --storage.tsdb.max-block-duration=2h \
  --web.enable-lifecycle

# Start Thanos sidecar
thanos sidecar \
  --prometheus.url=http://localhost:9090 \
  --tsdb.path=/prometheus \
  --objstore.config-file=/etc/thanos/bucket.yml \
  --grpc-address=0.0.0.0:10901 \
  --http-address=0.0.0.0:10902

# bucket.yml (example for OSS)
type: S3
config:
  bucket: "thanos"
  endpoint: "oss-cn-hangzhou.aliyuncs.com"
  access_key: "xxx"
  secret_key: "yyy"

Scalability :

One Prometheus can monitor 1 000+ targets.

Thanos provides near‑infinite scaling with multiple Prometheus instances and object storage.

Cortex offers multi‑tenant, cloud‑native scaling.

Cloud Provider HA – Built‑in

AWS CloudWatch : multi‑AZ HA, automatic scaling, SLA 99.9 %.

Alibaba Cloud ARMS : multi‑AZ HA, automatic scaling, SLA 99.95 %.

Datadog : global multi‑region deployment, SLA 99.9 % (standard) or 99.95 % (enterprise).

5. Data Security and Compliance

Prometheus + Grafana

Advantages :

Data fully under your control (private deployment)

Can be placed in an isolated network, no internet exposure

Meets data locality requirements for finance, government, etc.

Disadvantages :

Need to configure TLS for encrypted transport

Need to implement access control (Grafana Enterprise provides RBAC)

Need to set up audit logging yourself

Security hardening examples :

# Prometheus TLS configuration
tls_server_config:
  cert_file: server.crt
  key_file: server.key

# Grafana LDAP authentication
[auth.ldap]
enabled = true
config_file = /etc/grafana/ldap.toml

# Grafana RBAC (Enterprise)
[users]
viewers_can_edit = false
editors_can_admin = false

Cloud Provider Solutions

Advantages :

Certified (SOC 2, ISO 27001, etc.)

Built‑in encryption at rest and in transit

Built‑in audit logs

Fine‑grained IAM permissions

Disadvantages :

Data resides with the cloud provider (data sovereignty concerns)

Compliance depends on provider certifications

Cross‑border data transfer may be restricted

Conclusion : For industries with strict data sovereignty (finance, government), private Prometheus + Grafana is preferred; otherwise cloud provider solutions are sufficient.

Best Practices: How to Make the Right Choice

Decision Tree – Five Key Questions

Team size and technical ability?

<30 people, no dedicated SRE → Managed cloud solution

30‑100 people, has ops team → Prometheus + Grafana or cloud

100 + people, dedicated SRE → Prometheus + Grafana (lower cost)

Daily request volume?

<5 M/day → Managed cloud (lower total cost)

5‑20 M/day → Depends on budget and skill

>20 M/day → Prometheus + Grafana (cloud becomes expensive)

Budget?

Annual <5 k CNY → Prometheus + Grafana or Alibaba ARMS

5‑30 k CNY → Prometheus + Grafana (medium scale)

>30 k CNY → Consider Datadog or full‑stack solution

Deployment environment?

Public cloud (AWS/Alibaba) → Prefer corresponding cloud monitoring

Hybrid/multi‑cloud → Prometheus + Grafana (cloud‑agnostic)

On‑prem/private cloud → Prometheus + Grafana mandatory

Compliance requirements?

No special requirements → Either option

Data locality required → Private Prometheus + Grafana

Full audit needed → Prometheus + Grafana Enterprise or cloud provider enterprise tier

Recommendation Matrix

Team Size | Daily Requests | Technical Skill | Budget | Recommended Solution
--------------------------------------------------------------------------
<30       | <5M            | Weak            | <5k    | Alibaba ARMS / CloudWatch
<30       | <5M            | Strong          | <3k    | Prometheus + Grafana
30‑100    | 5‑20M          | Medium          | 5‑15k  | Prometheus + Grafana
30‑100    | 5‑20M          | Weak            | 10‑30k | Alibaba ARMS
100‑300   | 20‑50M         | Strong          | 15‑50k | Prometheus + Grafana + Thanos
100‑300   | 20‑50M         | Medium          | 30‑60k | Hybrid (Prometheus + Cloud APM)
>300      | >50M           | Strong          | >50k   | Hybrid solution
>300      | >50M           | High budget     | >100k  | Datadog (full‑stack)

Common Pitfalls and How to Avoid Them

Myth: Cloud monitoring is always expensive – For small workloads, cloud services can be cheaper because they eliminate human cost.

Myth: Prometheus is always cheaper – Requires operational expertise; for small teams the operational overhead can outweigh raw server cost.

Myth: One solution fits all – Hybrid approaches often give the best balance of cost, flexibility, and reliability.

Myth: APM is mandatory – 80 % of incidents are solved with basic metrics + logs; add APM only when needed.

Myth: Free Prometheus means zero cost – Human cost (maintenance, on‑call) is typically 5‑10× the server cost.

Cost Optimization Tips

Prometheus + Grafana

Adjust retention period (e.g., 30 days) and use Thanos for cheap cold storage.

Reduce scrape interval for non‑critical metrics (e.g., 60 s instead of 15 s).

Filter out unnecessary metrics with metric_relabel_configs.

Consider VictoriaMetrics as a drop‑in replacement for better compression (10× storage savings).

Cloud Provider Optimization

Consolidate custom metrics using dimensions instead of separate metric names.

Use Embedded Metric Format (EMF) to embed metrics in logs (cheaper than PutMetricData).

Sample logs (e.g., keep only 10 % of low‑value logs).

For ARMS, disable APM on non‑critical services and use sampling.

Conclusion and Outlook

Key Takeaways

Small teams (<30 people) with limited budget (<5 k CNY) → Managed cloud (Alibaba ARMS, CloudWatch).

Medium‑to‑large teams (>50 people) with high traffic (>10 M requests/day) and solid ops skills → Prometheus + Grafana self‑built.

Large enterprises needing full‑stack observability and ample budget → Hybrid or Datadog.

Prometheus + Grafana = open‑source, flexible, but requires ops effort.

Managed cloud = zero ops, easy start, but cost scales with usage.

Cost tipping point: <5 M requests/day → cloud cheaper; >10 M requests/day → self‑built cheaper.

2025 Trends

OpenTelemetry becomes the universal standard for metrics, logs, and traces.

eBPF‑based monitoring (Pixie, Cilium Hubble) gains traction for low‑overhead, agent‑less data collection.

AI‑driven intelligent monitoring (automatic anomaly detection, root‑cause analysis, predictive alerts).

Observability costs continue to drop thanks to more efficient TSDBs (VictoriaMetrics, M3DB) and smarter sampling.

Observability platforms converge into single‑pane‑of‑glass solutions (Grafana Cloud, Datadog, etc.).

Final Advice for Decision‑Makers

Be pragmatic – choose the solution that meets real needs, not the one with the most buzzwords.

Calculate total cost of ownership (service fees + servers + storage + human ops + training).

Adopt a gradual evolution: start with a managed service, migrate to self‑built as traffic grows, or begin with a single‑node Prometheus and scale to Thanos later.

For large organizations, a hybrid model (open‑source core monitoring + commercial APM/logging) offers the best ROI.

Remember: the best monitoring system is the one that helps you detect problems quickly, stays within budget, and is comfortable for your team to operate.

This article is based on the author’s experience in over 40 companies; all data and case studies are from real projects. Monitoring selection has no silver bullet – feel free to share your own stories in the comments.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringcloud-nativeObservabilityPrometheuscost comparison
MaGe Linux Operations
Written by

MaGe Linux Operations

Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.