Why Solid Monitoring Must Come Before Observability Projects (And How to Build It)
Before launching costly observability initiatives, ensure your monitoring is comprehensive and efficient, covering business, application, component, resource, network, and endpoint metrics, and that you have the data collection, storage, alerting, and event‑distribution capabilities to turn raw signals into actionable insights.
Many companies rush into observability projects without first establishing solid monitoring, leading to poor results and low business acceptance; the article advises verifying that monitoring is complete and offers a higher ROI before expanding to full observability.
Coverage Completeness
Monitoring should be divided into several categories, each requiring specific metrics and alerts.
Business Monitoring
Track business‑level indicators such as order volume; a sudden drop signals a problem that senior leadership will notice. These metrics often reside in relational or analytical databases, so the alert engine must be able to query OLTP/OLAP sources.
Application Monitoring
For web or RPC services, follow Google’s SRE metrics (Request, Error, Duration) and add Saturation (resource usage) to form the RED‑S model, which helps identify overload and capacity needs.
Component Monitoring
Monitor middleware, databases, distributed storage, and Kubernetes, as their health directly impacts applications. Understanding each component’s internals is essential; for example, MySQL health can be examined with show global status and other status commands.
Resource Monitoring
Observe runtime environments—physical machines, VMs, containers—by tracking CPU, memory, disk, network, and niche metrics like NTP, conntrack, or vmstat.
Network Monitoring
Cover network devices, links, and external egress. Use tools such as pingmesh or eBPF to collect connectivity and quality data; internet‑facing services also need outbound and regional probing.
Endpoint Monitoring
Collect client‑side data from apps, web pages, H5, or mini‑programs via instrumentation or SDKs, measuring page load time, interaction latency, and error rates.
Capability Completeness
Building a complete monitoring solution involves several technical layers.
Data Collection
Use agents and exporters such as Telegraf, Categraf, Grafana‑agent, Datadog‑agent, Filebeat, Fluentbit, or iLogtail to gather metrics and logs. Existing data in MySQL, Oracle, ClickHouse, or Postgres can be queried directly by the alert engine.
Data Storage
For metrics, VictoriaMetrics is recommended (Prometheus is also viable but single‑node). For logs, Elasticsearch is the default choice; large volumes may require ClickHouse, while cost‑sensitive setups can use Loki or OpenObserve with S3 back‑ends.
Alert Engine
Open‑source options include Grafana (strong visualization), Nightingale (good Prometheus compatibility), ElastAlert for Elasticsearch, and Clickvisual for ClickHouse alerts.
Event Distribution
After alerts fire, handling steps such as deduplication, noise reduction, routing, on‑call scheduling, and escalation often rely on tools like PagerDuty or Opsgenie.
Next Steps
Fill any missing data sources to achieve full coverage.
Integrate and correlate data across the monitoring stack.
Organize data per scenario to turn raw metrics into actionable insights.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
