Operations 12 min read

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

A seasoned operations professional shares personal experiences and hard‑earned insights on why traditional monitoring often becomes ineffective, how over‑automation and noisy dashboards hurt teams, and what a capability‑focused, user‑centric approach to observability should look like.

dbaplus Community

Jul 4, 2022

Why Most Monitoring Systems Fail: Lessons from a Veteran Ops Engineer

Evolution of Monitoring Practices

During two years of on‑call operations and subsequent development of a monitoring platform, the author observed a transition from a simple, effective monitoring setup to an over‑engineered system that generated excessive noise.

1. Early Effective Monitoring

In a monolithic environment the monitoring stack consisted of:

Nagios – basic host and service checks.

Zabbix – introduced automatic discovery and dynamic host registration.

ELK (Elasticsearch‑Logstash‑Kibana) – centralized log collection and search.

Because the number of alerts was low, each alarm required manual investigation. The most frequent incident was a surge of traffic caused by web crawlers. The response workflow was:

Check metrics – a sudden increase in request latency or CPU usage indicated a possible crawler.

Inspect logs – use Kibana to query the access logs, identify the top‑N IP ranges.

Block the sources – apply iptables rules to drop traffic from the offending IP blocks.

This three‑step loop kept alerts actionable and the system reliable.

2. Over‑Automation and Dashboard Overload

After moving to platform development, automation was extended to generate monitoring templates, alert rules, and Grafana dashboards for every micro‑service automatically. The consequences were:

Thousands of alerts flooded SMS channels, leading the team to mute notifications and rely on developers to notice problems.

Grafana dashboards displayed dozens of metrics per page, causing UI lag and making it impossible for developers to identify the most relevant charts.

Kibana dashboards suffered the same fate; only a few enthusiasts used them.

Result: monitoring data existed, but its consumption was ineffective.

3. Attempted SLO/SLI Adoption

Inspired by Google SRE, the team introduced:

Four “golden” indicators – latency, traffic, error rate, and saturation.

SLO/SLI definitions for each service.

Alert severity tiers (P0‑P2).

Rapid micro‑service iteration and an insufficient SRE‑to‑dev ratio caused these metrics to become stale within weeks. New alerts were added without validation, producing more noise rather than actionable insight.

4. Identified Biases in Monitoring

Two systemic misconceptions emerged:

Attempting to predict future incidents by extrapolating from isolated past events, ignoring the complex, time‑varying nature of production workloads.

Focusing on refining probes, dashboards, and automation pipelines without questioning whether the underlying signals actually reflect business‑critical health.

5. Human‑in‑the‑Loop vs. Full AIOps

Pure AIOps promises automated root‑cause analysis, but most operations teams lack confidence in machine‑only diagnoses. A pragmatic approach is to let machines perform large‑scale data aggregation and pattern detection, then present concise hypotheses for human operators to validate and act upon.

6. Capability‑Centric Monitoring Architecture

Instead of vertical layers aligned with organizational silos (infrastructure, network, business), the author proposes a horizontal capability model:

Data collection – agents, exporters, or sidecars gather metrics, logs, and traces.

Transport – reliable pipelines (e.g., Prometheus remote‑write, Fluentd, OpenTelemetry) move data to central stores.

Analysis – real‑time aggregation, anomaly detection, and SLO evaluation.

Storage – time‑series databases (Prometheus, VictoriaMetrics) and log stores (Elasticsearch).

Visualization – focused Grafana dashboards that expose only the metrics tied to defined SLOs.

The goal is a Thin Viable Platform (TVP) that delivers the minimal set of capabilities required by users, encourages best‑practice sharing, and avoids building unused layers.

7. Effectiveness Over Feature Richness

Unified monitoring across ops, development, DBA, and cache teams introduces challenges such as real‑time data processing, storage cost, and operational overhead. Over‑engineering generic solutions yields low ROI. The author recommends:

Prioritizing user‑centric, business‑value metrics (e.g., end‑user latency percentiles tied to SLOs).

Continuously validating that each alert or dashboard directly supports a decision point.

Limiting the number of dashboards to those that answer concrete operational questions.

8. Outlook and Open Questions on Observability

Observability is often presented as a universal remedy, yet practical concerns remain:

What concrete problems does observability solve beyond proliferating dashboards?

When is deep profiling (e.g., continuous tracing) necessary, and which teams benefit?

Is observability merely a re‑branding of traditional monitoring with added tooling?

Answering these questions requires disciplined metric selection, clear ownership of data pipelines, and a willingness to retire unused instrumentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations Observability SRE

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.