Operations 13 min read

Why Traditional Monitoring Fails and Observability Is the Future for Ops Teams

Drawing from years of ops experience, the author recounts the decline of traditional monitoring, the rise of automated dashboards, the challenges of AIOps and observability, and proposes a shift toward data‑driven, business‑focused capability building to make alerts truly useful.

Efficient Ops
Efficient Ops
Efficient Ops
Why Traditional Monitoring Fails and Observability Is the Future for Ops Teams

Based on years of experience working with operations, I found that monitoring often becomes ineffective.

1. My Monitoring Story

I worked in ops for over two years before moving to ops platform development, watching monitoring systems become increasingly useless.

1.1 Useful Monitoring

When I was on‑call, I thought the monitoring system was adequate because the services were monolithic and required few alerts.

Initially the company used Nagios, which was hard to maintain. I then explored Zabbix, whose biggest advantage was discovery and automatic addition of monitors. Later I set up an ELK stack to collect business logs, completing the monitoring setup.

Since we added few alerts, each alert had to be handled. The most common issue was bots crawling data, and I developed a reliable handling process:

Check metrics: if a business's load is high, there's a 90% chance it's caused by crawlers.

Check logs: use Kibana to view access records and identify top IP ranges.

Block access: use iptables to block the offending IPs.

This was my only real ops monitoring experience; because the business was simple, the original monitoring felt useful.

1.2 Useless Dashboards

1.2.1 Over‑Automation

After moving to ops development, the demand for monitoring changed. With improved automation, many open‑source monitoring tools became mature, and the platform started adding automatic binding of monitoring templates, alert templates, and Grafana dashboards to services.

The result was an overload of alerts, leading the ops team to mute company alert SMS. Most problems were discovered by the product side, with ops intervening later.

1.2.2 Pretty but Ineffective Dashboards

Because we collected massive metric data, we built Grafana dashboards for the business side. However, the dashboards were crowded, often causing the page to freeze, and developers could not tell which metrics were useful, forcing them to ask ops for help.

We also deployed ELK and handed Kibana to developers, but few used the dashboards beyond a handful of enthusiasts.

I believe ops' original intention is good, but in practice only ops benefit, as they rarely look at the dashboards they create.

1.3 No Qualitative Change

With the rise of Google SRE concepts, ops tried to adopt SLOs and SLIs, using four golden metrics (latency, traffic, errors, saturation) and creating layered alert severity (P0, P1, P2). However, fast‑moving microservice architectures and insufficient ops‑dev ratios caused these metrics to become obsolete quickly, leaving alerts ineffective.

2. Biases About Monitoring

From these failed experiences, I identified two fundamental problems:

Attempting to generalize from isolated past incidents to predict future widespread issues, while ignoring complex temporal and spatial variations. Focusing on optimizing traditional probe models, graphical trends, and alert mechanisms, and continuously automating related processes.

These observations reflect my current understanding of monitoring, without claiming correctness.

2.1 AI Ops or Human‑Machine Interaction?

A former colleague, after saying “drinking coffee while doing ops,” started researching AIOps, then left the team six months later. Most ops teams desire root‑cause analysis from AIOps, but the challenge feels like a mountain too steep to climb.

I often wonder: if ops themselves cannot always pinpoint the cause, why expect machines to?

Machines excel at massive data analysis; humans excel at decision‑making. Therefore, I think a human‑machine interaction model—machines analyze data, ops make decisions—may be more reliable.

Machines perform comprehensive data analysis; ops apply human judgment to the results.

However, this is difficult because SREs are increasingly focused on development, leaving little time for such decision processes.

2.2 Focus on Capability Building

Historically, monitoring systems were built vertically according to architecture layers, often reflecting Conway's law and role separation. Ops typically owned the lower layers, while higher‑level issues were passed to developers.

To break this cycle, I propose abandoning role‑based system construction and instead building capabilities in stages aligned with business value: data collection, transmission, analysis, storage, and visualization, evolving into a modern monitoring platform that centralizes rule computation and unified analysis.

During capability building, platform teams should aim for a Thinest Viable Platform (TVP), share best practices, and empower users, avoiding unimplemented methodologies.

Nevertheless, many of these foundational capabilities already have commercial solutions, raising the question of whether a small team can outperform a large professional team, especially when the impact on business is limited.

As we add more wheels, we become less effective, circling around generic problems with low ROI. The solution is to focus on user and business needs, abandon experience‑based generalizations, and leverage data analysis with human‑machine interaction, supported by AI and automation.

But this is hard, and I’m not doing monitoring anymore.

3. Outlook

Last year, observability became a hot topic. While I lack deep hands‑on experience, discussions reveal three recurring questions:

What problems does observability actually solve?

Many claim observability solves everything, yet when asked for concrete use cases, it often reverts to dashboards filled with metrics.

Why collect exhaustive profiling data in production?

The challenge is identifying which services truly need such fine‑grained observability and who the intended users are.

Is observability just new packaging of old ideas?

When explaining that observability comprises metrics, logs, and tracing, some veteran ops dismiss it as merely an enhanced version of existing monitoring.

Google’s “meaningful availability” raises the question of how to measure user‑level meaningful availability—a problem without a clear answer, but worth pondering.

Traditional monitoring is dead; observability is arriving. Share your stories in the comments.

MonitoringoperationsobservabilitySREAIOps
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.