Mastering Effective Monitoring: From Basics to the USE Method
This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.
Purpose
Monitoring collects key hardware, software, and (optionally) user‑experience data, exposes it to operators, and generates alerts when values deviate from expected ranges. The goal is to enable rapid manual or automated response so that failures are detected and mitigated before they affect end users.
What to Monitor
A monitoring object is any resource that provides attributes required by other components. Typical categories are:
Hardware – servers, network devices, storage appliances.
Software – applications, infrastructure services.
For a server the concrete monitoring items usually include:
CPU usage and load
Memory consumption
Network interface traffic and errors
Disk I/O and storage capacity
Controller health (e.g., RAID, power)
Service Level Indicators (SLI) and Objectives (SLO)
Each metric is evaluated against a Service Level Indicator (SLI), a quantitative measure of a specific aspect of service quality. Common SLIs are:
User‑facing services : availability, latency, throughput.
Storage systems : latency, throughput, durability.
An Service Level Objective (SLO) defines an acceptable numeric range for an SLI. For example, an 8‑core CPU might have an SLO of 0.0‑6.0 average load; a web API could target 99.9% availability over a month. SLOs are tuned per resource type and business context.
Quantitative Fault Description
Instead of vague statements such as “the site is down,” use precise numbers that can be acted upon, e.g.:
Response time = 30 s (unusable)
Active threads = 200, reaching the configured max_threads CPU load = 7.2 (above the SLO of 6.0)
Quantitative descriptions reduce information entropy and make root‑cause analysis faster.
USE Method for Fault Identification
The USE method (Utilization, Saturation, Error) provides a three‑dimensional checklist for each resource.
Utilization – how much of the resource is being used (e.g., CPU running percent, disk I/O).
Saturation – the degree to which the resource is approaching a bottleneck (e.g., memory usage, disk queue length).
Error – count of error events (e.g., failed connections, I/O errors).
Applying USE to a concrete system yields a set of metric‑to‑dimension mappings. The following MySQL HA example illustrates this mapping.
MySQL Monitoring Example
Business
Questions – total statements executed (Throughput)
Slow_queries – number of slow queries (Error)
Com_select – SELECT statements count (Throughput)
Com_insert – INSERT statements count (Throughput)
Com_update – UPDATE statements count (Throughput)
Threads & Connections
Threads_connected – current connections (Utilization)
Threads_running – active threads (Utilization)
Aborted_connects – failed connection attempts (Error)
Connection_errors_max_connections – connections rejected due to max‑connections limit (Error)
Buffer
Innodb_buffer_pool_pages_total – pages allocated in the buffer pool (Utilization)
Innodb_buffer_pool_read_requests – total read requests to the buffer pool (Utilization)
Practical Workflow
Identify the concrete resources to monitor (servers, databases, network devices, etc.).
Select a minimal set of metrics that cover Utilization, Saturation, and Error for each resource.
Define SLIs for each metric and set SLO thresholds that reflect business‑level reliability goals.
Instrument the metrics using a monitoring system (e.g., Prometheus, Zabbix, or any collector that can expose the chosen counters).
Configure alerts that fire when a metric crosses its SLO boundary, including severity levels for Utilization vs. Error conditions.
Continuously review alert noise, adjust thresholds, and add new metrics as the system evolves.
Key Takeaways
By defining clear monitoring objects, quantifying health with SLIs/SLOs, and applying the USE checklist, engineers can shift from reactive alarm handling to proactive fault prevention, reducing mean time to failure (MTTF) and mean time to recovery (MTTR).
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
