Essential Linux Monitoring Metrics for Open‑Falcon: A Complete Guide
This article enumerates the core Linux system metrics collected by the Open‑Falcon agent—including CPU, disk, memory, network, kernel, and process statistics—explaining how each metric is derived from /proc or other system tools and why it matters for reliable operations monitoring.
1. Basic Linux Monitoring Items
Effective operations rely on a robust monitoring system that captures as many relevant metrics as possible; the following list reflects practical experience from seasoned engineers.
2. CPU Metrics
cpu.idle – Percentage of time the CPU(s) were idle without outstanding disk I/O.
cpu.busy – 100 minus cpu.idle.
cpu.guest – Percentage of time spent running a virtual processor.
cpu.iowait – Percentage of idle time while the system had outstanding disk I/O.
cpu.irq – Percentage of time servicing hardware interrupts.
cpu.softirq – Percentage of time servicing software interrupts.
cpu.nice – Percentage of CPU utilization at user level with nice priority.
cpu.steal – Percentage of involuntary wait time for virtual CPUs.
cpu.system – Percentage of CPU utilization at the kernel level.
cpu.user – Percentage of CPU utilization at the application level.
cpu.cnt – Number of CPU cores.
cpu.switches – Number of context switches (counter).
3. Disk Metrics
Metrics are derived by reading /proc/mounts for mount points and using syscall.Statfs_t to obtain block and inode usage; each metric includes tags such as mount=$mount and fstype=$fstype.
df.bytes.free – Free disk space (int64).
df.bytes.free.percent – Free space as a percentage (float64).
df.bytes.total – Total disk size (int64).
df.bytes.used – Used disk space (int64).
df.bytes.used.percent – Used space as a percentage (float64).
df.inodes.total – Total inode count (int64).
df.inodes.free – Free inode count (int64).
df.inodes.free.percent – Free inode percentage (float64).
df.inodes.used – Used inode count (int64).
df.inodes.used.percent – Used inode percentage (float64).
4. megacli RAID Metrics
Metrics obtained via the megacli tool include tags like PD=Enclosure_ID:SLOT_ID or VD=0 to identify physical or virtual disks.
sys.disk.lsiraid.pd.Media_Error_Count – Indicates increased risk of disk failure.
sys.disk.lsiraid.pd.Other_Error_Count
sys.disk.lsiraid.pd.Predictive_Failure_Count
sys.disk.lsiraid.pd.Drive_Temperature
sys.disk.lsiraid.pd.Firmware_state – Non‑zero value signals a problem.
sys.disk.lsiraid.vd.cache_policy – Non‑zero value indicates cache policy mismatch.
sys.disk.lsiraid.vd.state – Non‑zero value signals a problem with the logical disk.
5. SMART Disk Metrics
Collected with smartctl; each metric is tagged with the device name (e.g., device=/dev/sda).
sys.disk.smart.Reallocated_Sector_Ct
sys.disk.smart.Spin_Retry_Count
sys.disk.smart.Reallocated_Event_Count
sys.disk.smart.Current_Pending_Sector
sys.disk.smart.Offline_Uncorrectable
sys.disk.smart.Temperature_Celsius
6. Partition Read/Write Monitoring
sys.disk.rw – Non‑zero value indicates read/write issues on the partition (tagged with mount=$mount).
7. IO Metrics
Collected every second from /proc/diskstats and calculated as counters.
disk.io.ios_in_progress – Number of I/O requests currently in flight.
disk.io.msec_read – Total milliseconds spent on reads.
disk.io.msec_total – Time during which ios_in_progress >= 1.
disk.io.msec_weighted_total – Weighted I/O time.
disk.io.msec_write – Total milliseconds spent on writes.
disk.io.read_merged – Number of merged read requests.
disk.io.read_requests – Total successful reads.
disk.io.read_sectors – Total sectors read.
disk.io.write_merged – Number of merged write requests.
disk.io.write_requests – Total successful writes.
disk.io.write_sectors – Total sectors written.
disk.io.read_bytes – Bytes read.
disk.io.write_bytes – Bytes written.
disk.io.avgrq_sz – Average request size (as shown by iostat -x 1).
disk.io.avgqu-sz – Average queue length.
disk.io.await – Average wait time.
disk.io.svctm – Service time.
disk.io.util – Utilization percentage (e.g., 56.43%).
8. Load Average Metrics
load.1min – 1‑minute load average.
load.5min – 5‑minute load average.
load.15min – 15‑minute load average.
9. Memory Metrics
Derived from /proc/meminfo; mem.memfree equals free + buffers + cached.
mem.memtotal – Total memory.
mem.memused – Used memory.
mem.memused.percent – Used memory percentage.
mem.memfree – Free memory.
mem.memfree.percent – Free memory percentage.
mem.swaptotal – Total swap.
mem.swapused – Used swap.
mem.swapused.percent – Used swap percentage.
mem.swapfree – Free swap.
mem.swapfree.percent – Free swap percentage.
10. Network Metrics
Collected from /proc/net/dev; each metric is tagged with iface=$iface (e.g., eth0). Metrics with “in” refer to inbound traffic, “out” to outbound, and “total” to the sum.
net.if.in.bytes, net.if.in.compressed, net.if.in.dropped, net.if.in.errors, net.if.in.fifo.errs, net.if.in.frame.errs, net.if.in.multicast, net.if.in.packets
net.if.out.bytes, net.if.out.carrier.errs, net.if.out.collisions, net.if.out.compressed, net.if.out.dropped, net.if.out.errors, net.if.out.fifo.errs, net.if.out.packets
net.if.total.bytes, net.if.total.dropped, net.if.total.errors, net.if.total.packets
11. Port Monitoring
Uses ss -ln to determine if a port is listening (1) or not (0); tagged with port=$port.
net.port.listen
12. Kernel Configuration
kernel.maxfiles – Value from /proc/sys/fs/file-max.
kernel.files.allocated – First field of /proc/sys/fs/file-nr.
kernel.files.left – Calculated as kernel.maxfiles - kernel.files.allocated.
kernel.maxproc – Value from /proc/sys/kernel/pid_max.
13. NTP Offset
Obtained with ntpq -pn.
sys.ntp.offset – Machine offset in milliseconds; large or zero values indicate anomalies.
14. Process Count Monitoring
proc.num – Counts processes either by name (e.g., name=sshd) or by full command line (e.g., cmdline=./falcon_agent-c./cfg.ini).
15. Process Resource Metrics
process.cpu.all – CPU time (sys + user) for a process and its children, in jiffies.
process.cpu.sys – System CPU time for a process and its children, in jiffies.
process.cpu.user – User CPU time for a process and its children, in jiffies.
process.swap – Swap usage for a process and its children, in pages.
process.fd – Number of file descriptors used.
process.mem – Memory usage of the process, in bytes.
16. ss Command Output Metrics
ss.orphaned
ss.closed
ss.timewait
ss.slabinfo.timewait
ss.synrecv
ss.estab
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
