Operations 24 min read

Essential 100‑Item IT Operations Checklist for Servers, Networks, Storage and More

A comprehensive, step‑by‑step checklist of 100 critical daily maintenance items covering server hardware, network devices, storage systems, operating systems, databases, virtualization, backup, security, data‑center infrastructure and documentation to help IT operations teams ensure reliability and business continuity.

dbaplus Community
dbaplus Community
dbaplus Community
Essential 100‑Item IT Operations Checklist for Servers, Networks, Storage and More

1. Server hardware maintenance (15 items)

Visually inspect chassis for physical damage, deformation, and indicator‑light status (power, disks, fans, fault LEDs).

Verify that all redundant power modules show a green LED, power cables are firmly seated, undamaged, and PDU LEDs, voltage and current readings are within normal ranges.

Confirm fan operation, noise level and clean fan grills; monitor inlet and outlet temperatures via management controller or sensor.

Check CPU utilization top / htop (Linux) or Task Manager (Windows) and ensure sustained load stays below 80 %.

Validate that reported memory matches the physical configuration; review memory usage and swap; query ECC errors with dmidecode or vendor tools.

Inspect disk health through RAID controller utilities ( MegaCLI, storcli, hpssacli) and OS tools; confirm RAID level is optimal and no disks are degraded or failed.

Verify status of PCIe devices (HBAs, NICs, GPUs) in the OS and device manager.

Test out‑of‑band management interfaces (iLO, iDRAC, iBMC) for network connectivity and successful login.

Periodically record BIOS/UEFI, BMC, RAID card and NIC firmware versions; schedule upgrades only when required.

Ensure all data cables are securely connected, clearly labeled and not excessively bent.

Review hardware event logs ( dmesg, journalctl, Windows Event Viewer) for critical or warning entries.

Maintain an inventory of spare parts (PSUs, fans, disks) and verify availability weekly or monthly.

Cross‑check asset records (location, tags, CPU, memory, disks) against the CMDB.

Keep server surfaces clean and free of dust.

Confirm all chassis screws are present and tightened.

2. Network device maintenance (15 items)

Visually inspect switches, routers and firewalls for damage and indicator‑light status.

Check redundant power modules, power‑cable connections and PDU status.

Verify fan operation and acceptable noise levels.

Monitor CPU and memory utilization via CLI or web UI; keep usage below 70 % during peak periods.

Inspect each port for up/up state, correct speed/duplex and error counters (input, output, CRC, giants, runts).

Confirm link‑aggregation groups are up and member ports are consistent.

Validate STP/RSTP/MSTP root bridge placement and port roles; watch for unexpected topology changes.

Review routing‑protocol neighbor status (BGP, OSPF, EIGRP) and ensure routing tables have converged.

Audit ACLs, policy routes and QoS policies for correct application.

Test out‑of‑band management ports and console access for connectivity and login.

Verify that running and startup configuration files are backed up after any change.

Search system logs for hardware errors, link changes or protocol flaps.

Check OS/firmware versions (IOS, NX‑OS, Junos, EOS, VRP) and plan upgrades.

Inspect physical cabling and labeling of copper and fiber connections.

Confirm rack mounting is secure, cables are tidy and airflow is adequate.

3. Storage system maintenance (14 items)

Check controller status – all controllers should be Online with no Failed or Degraded flags.

Verify redundant power supplies and fan modules are healthy.

Inspect enclosure and each disk for Online state and any Predictive Failure warnings.

Validate storage pool/LUN/volume health – status Normal / Online and no degraded components.

Confirm RAID groups are Optimal ; if a rebuild is in progress, monitor its progress and impact on performance.

Check cache subsystem status, battery or FC capacitor health (OK, Charged) and ensure no alerts.

Verify front‑end ports (FC, iSCSI, NFS, CIFS) are Online and error‑free.

Inspect back‑end SAS/FC connections to disk enclosures.

Monitor performance metrics – IOPS, throughput (MB/s) and latency (ms) – against established baselines.

Check snapshot and replication health; ensure no failed or pending jobs.

Assess capacity usage; trigger capacity expansion when utilization exceeds 80 %.

Review management interfaces and system logs for warnings or hardware events.

Periodically audit firmware versions of controllers, disks and enclosures; plan upgrades.

Inspect the physical environment – temperature, cable organization and labeling.

4. Operating system maintenance (15 items)

Check overall system load (e.g., uptime, w) and verify that critical services are running ( systemctl status on Linux, Service Manager on Windows).

Monitor CPU usage with top, htop, vmstat, mpstat (Linux) or Task Manager/Performance Monitor (Windows); look for sustained >80 % usage.

Review memory consumption, swap usage and buffer/cache statistics ( free -m, vmstat on Linux; Resource Monitor on Windows).

Inspect disk usage on all mount points ( df -h or Resource Monitor) and identify large or rapidly growing files ( du -sh *, ncdu).

Track disk I/O performance with iostat -dx 1 (Linux) or Performance Monitor (Windows) to spot bottlenecks.

Verify network interface status and IP configuration ( ip addr / ifconfig on Linux, ipconfig on Windows).

Monitor network traffic ( iftop, nload, vnstat on Linux; Resource Monitor on Windows) and TCP connection states ( netstat -anp, ss).

Audit recent user logins ( who, w on Linux; query user on Windows) and review login history ( last, Windows Event Viewer).

Detect abnormal accounts or privilege‑escalation events.

Review scheduled tasks ( crontab -l, /etc/cron* on Linux; Task Scheduler on Windows) and verify successful execution.

Run file‑system integrity checks ( fsck on Linux, chkdsk on Windows) during maintenance windows.

Check for available package updates ( yum check-update, apt list --upgradable on Linux; Windows Update) and plan patch installation following change‑management procedures.

Ensure NTP service is synchronized ( ntpq -p, timedatectl on Linux; w32tm /query /status on Windows).

Audit security configurations – SSH daemon settings, password policies, firewall rules.

Validate resource consumption of critical processes (databases, middleware, applications) for abnormal CPU, memory or handle usage.

5. Database maintenance (14 items)

Confirm database instance status (Oracle, MySQL, SQL Server) via native commands (e.g., SELECT status FROM v$instance; for Oracle).

Check listener/service availability ( lsnrctl status for Oracle, SHOW PROCESSLIST; for MySQL, SQL Server Configuration Manager).

Monitor tablespace/filegroup usage; ensure sufficient free space ( DBA_FREE_SPACE, information_schema.FILES, sp_helpdb).

Track performance metrics – active sessions, logical/physical reads, cache‑hit rate, lock waits.

Identify slow queries using AWR/ASH (Oracle), slow‑query log (MySQL) or sp_whoisactive (SQL Server).

Verify that recent full and incremental backups succeeded; review backup size and logs.

Perform periodic restore tests to confirm backup integrity.

Review alert and error logs for critical messages.

Audit transaction‑log health and usage to avoid log‑full conditions.

Check job‑scheduler status for backup, statistics‑collection and archiving jobs.

Validate that table and index statistics are up‑to‑date.

Monitor connection/session counts; flag abnormal or idle sessions.

Inspect replication status (MySQL Replication, SQL Server AlwaysOn, Oracle Data Guard) and latency.

Conduct regular security audits of user permissions and audit logs.

6. Virtualization platform maintenance (8 items)

Check cluster health (vCenter, SCVMM, Proxmox VE) for isolated hosts or error states.

Verify ESXi/Hyper‑V/KVM host connectivity and health – CPU, memory, storage, network alerts and patch level.

Ensure all virtual machines are powered on as expected and responsive.

Inspect datastore/LUN/storage‑pool status, capacity usage and performance (latency, IOPS); watch for APD/PDL conditions.

Monitor resource‑pool and host CPU/memory utilization for contention.

Validate HA, FT and DRS features are operational.

Confirm VM backup jobs complete successfully and that backup files are verified.

Check management nodes (vCenter Server, SCVMM, Proxmox) for performance, log health and firmware/driver versions.

7. Backup system maintenance (7 items)

Verify scheduled backup jobs (full, incremental, differential) finish without errors; review job logs for warnings.

Periodically validate backup data integrity and conduct recovery drills.

Monitor backup storage capacity (disk libraries, tape libraries, cloud) to ensure sufficient free space.

If tape is used, inspect tape‑drive health and media condition.

Review backup policies (RPO, RTO, retention) for continued business relevance.

Check that backup agents on all servers/applications are online.

Validate backup software/server health, performance and license validity.

Confirm off‑site replication or cloud‑backup tasks are synchronized.

8. Security devices and policy maintenance (12 items)

Check firewall engine and HA status, interface health and session counts.

Audit activation of critical security policies (ACL, NAT, IPS/IDS, application control).

Review IPS/IDS alerts and firewall deny logs for suspicious activity.

Verify VPN tunnel status and active user connections.

Confirm antivirus definitions are up‑to‑date and scheduled scans run without major alerts.

Examine latest vulnerability‑scan reports; remediate high‑severity findings promptly.

Audit centralized log platform or device logs for security events (failed logins, privilege changes, policy modifications).

Review ACLs on firewalls, routers and servers; remove stale rules.

Check SSL/VPN/HTTPS certificate expiration dates.

Backup security‑device configurations after any change.

Update OS, IPS signatures and virus definitions on security appliances.

Ensure firmware/feature‑library versions are current and schedule updates.

9. Data‑center infrastructure maintenance (10 items)

Continuously monitor temperature (22‑24 °C) and humidity (40‑60 % RH) and alert on threshold breaches.

Inspect UPS input/output voltage, current, frequency, load percentage and battery health (float voltage, internal resistance, estimated runtime).

Check precision air‑conditioning operation – set points, supply/return temperatures and alarm status.

Verify PDU status, branch‑circuit currents, voltages and indicator lights.

Test leak‑detection sensors for functionality.

Inspect fire‑suppression system (gas, smoke/heat detectors) for normal pressure and green‑zone readings.

Test access‑control card/biometric readers, door sensors and audit logs.

Confirm video‑surveillance cameras cover critical zones and recordings are stored correctly.

Ensure all equipment, cables and switches are clearly labeled and accurately documented.

Maintain cleanliness of floor, rack tops and hot/cold aisles; verify rack doors are closed.

10. Documentation and process

Record each checklist execution (date, executor, results, exception handling).

Define clear incident‑reporting, escalation and remediation workflows for any anomalies.

Conduct periodic (quarterly/bi‑annual) reviews of the checklist itself, updating items based on business changes, technology evolution or incident lessons.

Integrate checklist standards and common issue resolutions into the operations knowledge base.

These structured checks provide a repeatable, standardized approach for IT operations teams to maintain system health, detect issues early and ensure service continuity across the entire technology stack.

IT Operationschecklistserver maintenancenetwork maintenancestorage maintenance
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.