Operations 12 min read

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies deploy fragmented monitoring and observability tools yet still struggle to pinpoint incidents; this article analyzes the root causes—under‑utilized tools and scenario‑agnostic data—and offers practical steps to organize metrics, build layered insights, and improve fault‑resolution efficiency.

ITPUB

Feb 11, 2025

Why Your Monitoring Fails and How to Build Effective Observability Data

Many companies have built scattered monitoring and observability systems, but when an online incident occurs, locating the root cause remains difficult. This article examines why this happens and proposes concrete ideas to improve the situation.

Existing Tools Are Not Fully Utilized

Although enterprises often deploy solutions such as Zabbix, Prometheus, ElasticSearch, and Jaeger, they frequently fail to use these tools deeply enough. Common shortcomings include:

Unclear metric meanings, missing important indicators, and collecting unnecessary tags.

Inadequate alarm rule configuration tailored to business needs.

Insufficient instrumentation of services (e.g., missing RED metrics, lack of /varz endpoints).

Exporters gather generic data, but business‑specific metrics (user registrations, order volume) are often omitted.

Google SRE describes a /varz HTTP endpoint that exposes a service’s own monitoring data; many organizations have not implemented such self‑exposure.

Data Is Not Built Based on Real‑World Scenarios

Collecting raw data without a clear purpose yields only the first layer of information. To extract actionable insights, observability platforms must guide users toward higher‑level information. The four information layers are illustrated below:

Only presenting raw data forces developers and operators to write complex queries themselves, which is impractical. Most companies stop at the first or second layer, missing the richer insights that higher layers provide.

How to Build Higher‑Level Information

Both bottom‑up and top‑down approaches are viable. For example, a MySQL monitoring dashboard can show raw metrics (data layer) while a separate “feature” dashboard aggregates these into rankings or heatmaps, highlighting the most problematic instances at a glance.

Linking overview charts to detailed dashboards enables quick navigation from a high‑level view to specific instance metrics.

Beyond metrics, “viewpoint” information combines multiple observability signals (order volume drops, service health, logs, traces) to answer questions such as:

Is the service’s external API functioning (RED indicators)?

Has the service undergone recent changes that might cause instability?

Is the dependent MySQL instance healthy?

Are there abnormal error logs?

Are downstream services operating normally (trace data)?

Organizing Data Sets and Building a Global Cockpit

By aggregating service‑level SLI data, change logs, dependency metrics, logs, and trace information into a unified data set, teams can create a global cockpit that hierarchically presents health status across business, system, subsystem, and service levels. When an incident occurs, the cockpit helps quickly pinpoint the affected service and drill down for detailed analysis.

Automatic Insight Extraction Is Still Challenging

While the ultimate goal of an observability product is to automatically provide loss‑prevention guidance, most organizations lack a complete, well‑governed data foundation, making full automation unrealistic. A pragmatic approach is to first organize data sets, gradually enhance the cockpit, and let users derive insights manually.

Summary

Two main reasons cause difficulty in fault location: (1) existing tools are not fully utilized, and (2) data is not built based on concrete scenarios. The article suggests improving tool adoption, securing executive support, forming dedicated teams, integrating external expertise, and constructing layered, scenario‑driven data pipelines to achieve more effective observability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Engineering Monitoring Observability SRE incident response

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.