Essential Linux Operations Metrics for Effective Monitoring
This guide enumerates the key Linux system metrics—covering CPU, memory, disk, I/O, network, kernel parameters, RAID, SMART, NTP, and process information—that open-falcon agents collect every minute to enable comprehensive operations monitoring and timely issue detection.
1. Linux Operations Basic Collection Items
Effective monitoring requires collecting as many relevant metrics as possible; the following categories are commonly used by engineers.
CPU
Load
Memory
Disk
IO
Network
Kernel parameters
ss statistics
Port collection
Core service process health
Critical business process resource consumption
NTP offset
DNS resolution
All these metrics are directly supported by the open-falcon agent, which collects them every 60 seconds.
2. CPU Metrics
cpu.idle – percentage of idle time without outstanding disk I/O.
cpu.busy – 100 minus cpu.idle.
cpu.guest – time spent running virtual processors.
cpu.iowait – idle time while disk I/O is pending.
cpu.irq – time servicing hardware interrupts.
cpu.softirq – time servicing software interrupts.
cpu.nice – CPU utilization at nice priority.
cpu.steal – involuntary wait time for virtual CPUs.
cpu.system – CPU usage in kernel mode.
cpu.user – CPU usage in user mode.
cpu.cnt – number of CPU cores.
cpu.switches – number of context switches.
3. Disk Metrics
Metrics are derived from /proc/mounts and syscall.Statfs_t, with tags such as mount and fstype.
df.bytes.free – free bytes (int64).
df.bytes.free.percent – free space percentage (float64).
df.bytes.total – total bytes (int64).
df.bytes.used – used bytes (int64).
df.bytes.used.percent – used space percentage (float64).
df.inodes.total – total inodes (int64).
df.inodes.free – free inodes (int64).
df.inodes.free.percent – free inode percentage (float64).
df.inodes.used – used inodes (int64).
df.inodes.used.percent – used inode percentage (float64).
4. megacli RAID Metrics
sys.disk.lsiraid.pd.Media_Error_Count
sys.disk.lsiraid.pd.Other_Error_Count
sys.disk.lsiraid.pd.Predictive_Failure_Count
sys.disk.lsiraid.pd.Drive_Temperature
sys.disk.lsiraid.pd.Firmware_state – non-zero indicates a problem.
sys.disk.lsiraid.vd.cache_policy – non-zero indicates cache mismatch.
sys.disk.lsiraid.vd.state – non-zero indicates logical disk issue.
5. SMART Metrics
sys.disk.smart.Reallocated_Sector_Ct
sys.disk.smart.Spin_Retry_Count
sys.disk.smart.Reallocated_Event_Count
sys.disk.smart.Current_Pending_Sector
sys.disk.smart.Offline_Uncorrectable
sys.disk.smart.Temperature_Celsius
6. Partition Read/Write Monitoring
sys.disk.rw – non-zero indicates read/write issue on the partition.
7. IO Metrics
disk.io.ios_in_progress – current I/O requests.
disk.io.msec_read – total milliseconds spent reading.
disk.io.msec_write – total milliseconds spent writing.
disk.io.msec_total – time with at least one I/O in flight.
disk.io.msec_weighted_total – weighted I/O time.
disk.io.read_merged, disk.io.write_merged – merged requests.
disk.io.read_requests, disk.io.write_requests – number of reads/writes.
disk.io.read_sectors, disk.io.write_sectors – sectors transferred.
disk.io.read_bytes, disk.io.write_bytes – bytes transferred.
disk.io.avgrq_sz, disk.io.avgqu-sz, disk.io.await, disk.io.svctm, disk.io.util – standard iostat metrics.
8. Load Metrics
load.1min
load.5min
load.15min
9. Memory Metrics
mem.memtotal – total memory.
mem.memused – used memory.
mem.memused.percent – used memory percentage.
mem.memfree – free memory.
mem.memfree.percent – free memory percentage.
mem.swaptotal, mem.swapused, mem.swapused.percent, mem.swapfree, mem.swapfree.percent – swap statistics.
10. Network Metrics
Collected from /proc/net/dev with iface tag.
net.if.in.bytes, net.if.out.bytes, net.if.total.bytes, etc.
Various error, drop, packet, and compressed counters for inbound and outbound traffic.
11. Port Monitoring
net.port.listen – 1 if the port is listening, 0 otherwise.
12. Kernel Configuration
kernel.maxfiles – /proc/sys/fs/file-max.
kernel.files.allocated – first field of /proc/sys/fs/file-nr.
kernel.files.left – maxfiles minus allocated.
kernel.maxproc – /proc/sys/kernel/pid_max.
13. NTP Offset
sys.ntp.offset – offset in ms; large or zero values indicate anomalies.
14. Process Monitoring
proc.num – number of processes matching name or cmdline.
15. Process Resource Metrics
process.cpu.all, process.cpu.sys, process.cpu.user – CPU usage in jiffies.
process.swap – swap usage in pages.
process.fd – file descriptor count.
process.mem – memory usage in bytes.
16. ss Command Output
ss.orphaned, ss.closed, ss.timewait, ss.slabinfo.timewait, ss.synrecv, ss.estab.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
MaGe Linux Operations
Founded in 2009, MaGe Education is a top Chinese high‑end IT training brand. Its graduates earn 12K+ RMB salaries, and the school has trained tens of thousands of students. It offers high‑pay courses in Linux cloud operations, Python full‑stack, automation, data analysis, AI, and Go high‑concurrency architecture. Thanks to quality courses and a solid reputation, it has talent partnerships with numerous internet firms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
