Operations 27 min read

100‑Point IT Operations Checklist: From Server Health to Data Center Safety

A comprehensive 100‑item checklist guides IT operations engineers through daily inspections of servers, network gear, storage, operating systems, databases, virtualization, backup, security devices, and data‑center infrastructure, ensuring reliable performance, proactive issue detection, and adherence to best‑practice standards.

ITPUB
ITPUB
ITPUB
100‑Point IT Operations Checklist: From Server Health to Data Center Safety

Server Hardware Maintenance

Key checks for each server include:

Physical condition : Verify chassis integrity, indicator lights and any visible damage.

Power : Confirm redundant power modules are green, power cables are secure and PDU status is normal.

Fans and cooling : Ensure fans run without abnormal noise, clean fan grills and monitor inlet/outlet temperatures.

CPU status : Use top, htop (Linux) or Task Manager (Windows) and out‑of‑band tools (iLO/iDRAC/IMM) to check utilization and temperature (alert if >80% sustained load).

Memory status : Verify reported memory matches hardware, check usage and swap, and run dmidecode for ECC errors.

Disk status : Check disk LEDs, use RAID utilities ( MegaCLI, storcli, hpssacli) or OS commands to ensure disks are Online and not in Predictive Failure state.

Backplane and cables : Ensure SAS/SATA/NVMe connections are firm.

PCIe devices : Verify status of HBAs, NICs, GPUs in OS and device manager.

Management interfaces : Test out‑of‑band interfaces (iLO, iDRAC, iBMC) for connectivity and login.

Firmware versions : Review BIOS/UEFI, BMC, RAID and NIC firmware and plan upgrades.

Physical connections : Confirm all cables are firmly attached, labeled and not bent.

Log inspection : Review hardware logs via dmesg, journalctl or vendor tools for Critical, Error or Warning entries.

Spare parts : Verify inventory of critical components (PSUs, fans, disks).

Asset verification : Match location, tags and configuration (CPU, RAM, disks) with CMDB records.

Cleanliness : Ensure server surfaces and surrounding area are free of dust.

Screws and safety : Confirm chassis covers are sealed and all screws are present.

Network Equipment Maintenance

For switches, routers and firewalls:

Physical condition : Inspect chassis, indicator lights and any damage.

Power : Verify redundant power modules, cable connections and PDU status.

Fans and cooling : Check fan operation and noise levels.

CPU & memory utilization : Use CLI or web UI to keep usage below 70%.

Port status :

All business ports should be up/up with correct speed/duplex.

Identify and investigate err‑disable ports.

Monitor error counters ( input errors, output errors, CRC, giants, runts) for trends.

Link aggregation : Verify port‑channel members are up and no member is dropped.

Spanning‑tree : Check STP/RSTP/MSTP root bridge, port roles ( Root, Designated, Alternate/Blocking) and ensure no unexpected topology changes.

Routing protocols : Confirm BGP/OSPF/EIGRP neighbors are Established / Full and routing tables are stable.

ACL & policy : Review critical ACLs, policy‑routing and QoS policies.

Management access : Test out‑of‑band management ports and console access.

Configuration & backup : Verify running and startup configs and back them up.

Log inspection : Scan system logs for ERROR, WARNING, CRIT or FAIL entries.

Storage System Maintenance

For SAN/NAS arrays and attached disks:

Controller status : All controllers must be Online and not Failed or Degraded.

Power & fans : Verify redundant power supplies and fan modules.

Disk cabinets & disks :

Check cabinet health and link status.

Ensure each physical disk reports Online, Spare or Normal and not Failed / Predictive Failure.

Inspect disk slot LEDs.

Pool/LUN/volume status : Pools and LUNs should be Normal / Online without degradation.

RAID status : RAID groups must be Optimal; monitor any rebuild progress.

Cache status : Verify read/write cache is enabled and battery/FCWC is OK or Charged.

Front‑end ports : FC, iSCSI, NFS, CIFS ports should be Online with no error spikes.

Back‑end ports : Check SAS/FC back‑end ports to disk cabinets.

Performance monitoring : Track IOPS, throughput (MB/s) and latency (ms) against baseline.

Snapshot & replication : Verify local snapshots and remote replication are healthy.

Capacity management :

Monitor total, used and free capacity of pools/filesystems.

Alert if utilization exceeds 80%.

Management interface & logs : Ensure connectivity and review system/event logs.

Operating System Maintenance

Applicable to Linux and Windows hosts:

System load & status : Use uptime, w (Linux) or Performance Monitor (Windows) to view average load.

Critical services : Verify essential services with systemctl status (Linux) or Service Manager (Windows).

CPU utilization : Monitor with top, htop, vmstat 1, mpstat -P ALL 1 (Linux) or Task Manager (Windows); watch %idle for high load.

Memory usage :

Run free -m or vmstat to see total, used, free and cache.

Check swap usage with free, swapon -s (Linux) or page‑file usage (Windows).

Disk space : Use df -h (Linux) or Resource Monitor/ wmic (Windows) to verify mount point usage.

Large files : Identify growth with du -sh * | sort -h (Linux) or WinDirStat (Windows).

Disk I/O : Monitor with iostat -dx 1 (Linux) or Performance Monitor (Windows) for bottlenecks.

Network connectivity & bandwidth :

Check interface status and IP with ip addr, ifconfig (Linux) or ipconfig (Windows).

Monitor traffic with iftop, nload, vnstat (Linux) or third‑party tools (Windows).

Inspect TCP states with netstat -anp, ss (Linux) or netstat -ano (Windows); watch excessive TIME_WAIT or CLOSE_WAIT.

User & login :

List current users with who, w (Linux) or query user (Windows).

Review recent logins with last (Linux) or Event Viewer (Windows).

Detect abnormal or privileged accounts.

Critical process resource usage : Monitor CPU, memory and handle count of database, middleware and application processes.

System logs : Centralize and filter /var/log/messages, /var/log/syslog, dmesg (Linux) or Event Viewer (Windows) for ERROR, WARNING, CRIT entries.

Scheduled tasks : Verify crontab -l, /etc/cron* (Linux) or Task Scheduler (Windows) execution status.

File system health : Run fsck (Linux) or chkdsk (Windows) during maintenance windows.

Package & patch management :

Check available updates with yum check-update, apt list --upgradable (Linux) or Windows Update.

Follow change‑management process for testing and applying patches.

Time synchronization : Verify NTP service with ntpq -p, timedatectl (Linux) or w32tm /query /status (Windows).

Security configuration : Audit SSH config ( /etc/ssh/sshd_config), password policies and firewall rules ( iptables, nftables, firewalld on Linux; Windows Firewall on Windows).

Backup verification : Ensure critical configuration files are backed up and recoverable.

Database Maintenance

Applicable to Oracle, MySQL and SQL Server:

Instance status : Query SELECT status FROM v$instance (Oracle), SHOW DATABASES (MySQL) or SELECT state_desc FROM sys.databases (SQL Server) to confirm the instance is healthy.

Listener status : Check Oracle listener with lsnrctl status, MySQL process list with SHOW PROCESSLIST, or SQL Server Configuration Manager for connectivity.

Tablespace / filegroup usage : Verify free space using DBA_FREE_SPACE (Oracle), information_schema.FILES (MySQL) or sp_helpdb / sys.database_files (SQL Server).

Performance monitoring : Track sessions, logical/physical reads, cache hit ratio and lock waits; identify slow queries via AWR/ASH (Oracle), slow‑query log (MySQL) or sp_whoisactive (SQL Server).

Backup status : Confirm latest full and incremental backups succeeded, validate backup size and logs, and perform periodic restore drills.

Log files : Review alert logs ( alert_*.log for Oracle) or error logs for ORA‑ / Error messages.

Job & scheduler : Verify critical jobs (backup, statistics collection, archiving) run successfully via Oracle Scheduler, MySQL Event Scheduler or SQL Server Agent.

Statistics : Ensure table and index statistics are up‑to‑date.

Connections & sessions : Monitor active connections, flag abnormal or long‑idle sessions.

Replication status : Check master‑slave replication health (MySQL, SQL Server, Oracle DG) and latency.

Security audit : Review user privileges and audit logs for compliance.

Virtualization Platform Maintenance

For vCenter/SCVMM/Proxmox clusters and ESXi/Hyper‑V/KVM hosts:

Cluster health : Verify the management cluster is online, with no isolated hosts or errors.

Host health : Check each hypervisor for CPU, memory, storage and network alerts and confirm patch levels.

VM status : Ensure all virtual machines are powered on as expected and no VM is unresponsive.

Datastore/storage health : Monitor datastore/LUN usage, latency and IOPS; watch for APD/PDL conditions.

Network health : Inspect virtual switches, port groups and physical NIC bindings.

Resource pools & utilization : Track cluster and host CPU/memory usage for contention.

HA/FT/DRS : Confirm high‑availability, fault‑tolerance and distributed resource scheduling are operational.

Backup status : Verify VM backup jobs complete successfully and backup files are validated.

Management nodes : Check health and logs of vCenter/SCVMM/Proxmox management servers.

Firmware & drivers : Review HBA and NIC firmware and driver versions on hosts.

Backup System Maintenance

For enterprise backup solutions (disk, tape or cloud):

Backup job status : Ensure all scheduled full, incremental and differential jobs finish without errors.

Backup data verification :

Run integrity checks if supported.

Perform regular restore drills for critical data to confirm recoverability.

Backup storage capacity : Monitor usage of disk/tape/cloud targets and ensure sufficient free space.

Media health : For tape systems, check drive status, tape condition and library robot health.

Backup strategy review : Periodically assess RPO/RTO and retention policies against business needs.

Backup client status : Verify agents on all servers/applications are online.

Backup software health : Check backup server performance, logs and license validity.

Off‑site backup : Review remote replication or cloud backup status and synchronization.

Security Devices & Policy Maintenance

For firewalls, IPS/IDS, VPN and endpoint protection:

Firewall status : Check engine health, HA state, interface status and session counts.

Security policy status : Ensure ACLs, NAT, IPS/IDS and application control policies are active.

Threat detection & logs :

Review IPS/IDS alerts for recent threats.

Analyze firewall deny logs for abnormal scans or attacks.

VPN status : Verify tunnel is up and monitor user connections.

Antivirus status : Confirm definitions are up‑to‑date and scans run without large‑scale detections.

Vulnerability scan results : Review latest scan reports and track remediation of high‑severity findings.

Log audit : Examine SIEM or device logs for login failures, privilege changes, policy modifications and high‑risk operations.

ACL audit : Periodically clean expired or invalid ACL entries on firewalls, routers and servers.

Certificate status : Check SSL VPN and HTTPS proxy certificates for expiration.

Configuration backup : After changes, back up firewall, IPS and WAF configuration files.

Firmware & signature updates : Keep OS, IPS signatures and virus definitions current.

Data‑Center Infrastructure Maintenance

Physical environment checks:

Temperature & humidity : Continuously monitor (22‑24 °C, 40‑60 % RH) and stay within thresholds.

UPS status :

Check input/output voltage, current, frequency and load percentage.

Inspect battery voltage, internal resistance and estimated backup time.

Ensure UPS operates in Normal (online) mode.

Precision air‑conditioning : Verify operation, set points, supply/return temperatures, compressor/fan status and alarms.

Power distribution cabinets : Review total input, branch circuit currents, voltages, switch positions and indicator lights.

Leak detection : Confirm the system is functional, probes are correctly placed and no alarms are active.

Fire suppression : Ensure gas‑based system and smoke/heat detectors are normal; pressure gauges should be in the green zone.

Access control : Test card/biometric readers, door sensors and retrieve logs.

Video surveillance : Verify camera clarity, coverage of critical zones and proper recording storage.

Physical environment :

Maintain clean floors and rack tops, no dust accumulation.

Keep hot/cold aisles clear of obstructions.

Ensure rack doors are closed.

Labeling : Confirm all equipment, cables and circuits have clear, accurate labels.

Documentation & Process

Checklist execution record : Log date, executor, results and remediation for each run.

Exception handling workflow : Define reporting, response, escalation and resolution steps for any anomalies.

Periodic review : Conduct quarterly or semi‑annual review of the checklist to add, remove or modify items based on business changes, technology evolution and incident lessons.

Knowledge‑base update : Capture standards and common issue resolutions in the operations knowledge base.

databaseBackupData CenterIT Operationschecklistserver maintenancenetwork maintenance
ITPUB
Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.