Operations 10 min read

Mastering Effective Monitoring: From Basics to the USE Method

This article explains the fundamentals of monitoring, distinguishes traditional OPS from SRE perspectives, defines monitoring objects and metrics, introduces quantitative thinking with SLI/SLO, and presents the USE method with a MySQL example to help engineers detect and prevent failures efficiently.

ITPUB
ITPUB
ITPUB
Mastering Effective Monitoring: From Basics to the USE Method

Purpose

Monitoring collects key hardware, software, and (optionally) user‑experience data, exposes it to operators, and generates alerts when values deviate from expected ranges. The goal is to enable rapid manual or automated response so that failures are detected and mitigated before they affect end users.

What to Monitor

A monitoring object is any resource that provides attributes required by other components. Typical categories are:

Hardware – servers, network devices, storage appliances.

Software – applications, infrastructure services.

For a server the concrete monitoring items usually include:

CPU usage and load

Memory consumption

Network interface traffic and errors

Disk I/O and storage capacity

Controller health (e.g., RAID, power)

Service Level Indicators (SLI) and Objectives (SLO)

Each metric is evaluated against a Service Level Indicator (SLI), a quantitative measure of a specific aspect of service quality. Common SLIs are:

User‑facing services : availability, latency, throughput.

Storage systems : latency, throughput, durability.

An Service Level Objective (SLO) defines an acceptable numeric range for an SLI. For example, an 8‑core CPU might have an SLO of 0.0‑6.0 average load; a web API could target 99.9% availability over a month. SLOs are tuned per resource type and business context.

Quantitative Fault Description

Instead of vague statements such as “the site is down,” use precise numbers that can be acted upon, e.g.:

Response time = 30 s (unusable)

Active threads = 200, reaching the configured max_threads CPU load = 7.2 (above the SLO of 6.0)

Quantitative descriptions reduce information entropy and make root‑cause analysis faster.

USE Method for Fault Identification

The USE method (Utilization, Saturation, Error) provides a three‑dimensional checklist for each resource.

Utilization – how much of the resource is being used (e.g., CPU running percent, disk I/O).

Saturation – the degree to which the resource is approaching a bottleneck (e.g., memory usage, disk queue length).

Error – count of error events (e.g., failed connections, I/O errors).

Applying USE to a concrete system yields a set of metric‑to‑dimension mappings. The following MySQL HA example illustrates this mapping.

MySQL Monitoring Example

Business

Questions – total statements executed (Throughput)

Slow_queries – number of slow queries (Error)

Com_select – SELECT statements count (Throughput)

Com_insert – INSERT statements count (Throughput)

Com_update – UPDATE statements count (Throughput)

Threads & Connections

Threads_connected – current connections (Utilization)

Threads_running – active threads (Utilization)

Aborted_connects – failed connection attempts (Error)

Connection_errors_max_connections – connections rejected due to max‑connections limit (Error)

Buffer

Innodb_buffer_pool_pages_total – pages allocated in the buffer pool (Utilization)

Innodb_buffer_pool_read_requests – total read requests to the buffer pool (Utilization)

Practical Workflow

Identify the concrete resources to monitor (servers, databases, network devices, etc.).

Select a minimal set of metrics that cover Utilization, Saturation, and Error for each resource.

Define SLIs for each metric and set SLO thresholds that reflect business‑level reliability goals.

Instrument the metrics using a monitoring system (e.g., Prometheus, Zabbix, or any collector that can expose the chosen counters).

Configure alerts that fire when a metric crosses its SLO boundary, including severity levels for Utilization vs. Error conditions.

Continuously review alert noise, adjust thresholds, and add new metrics as the system evolves.

Key Takeaways

By defining clear monitoring objects, quantifying health with SLIs/SLOs, and applying the USE checklist, engineers can shift from reactive alarm handling to proactive fault prevention, reducing mean time to failure (MTTF) and mean time to recovery (MTTR).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringOperationsmetricsSRESLOSLIUSE method
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.