Operations 16 min read

Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues

This guide explains why server IO utilization spikes above 60% during early‑morning hours, covering hardware self‑test, RAID battery failures, cache policy misconfigurations, and step‑by‑step commands for MegaRAID and HP servers, plus BIOS adjustments and best‑practice recommendations to prevent performance degradation.

dbaplus Community
dbaplus Community
dbaplus Community
Why Does Server IO Spike at 3 AM? Diagnose RAID Battery and Self‑Test Issues

Problem Overview

Monitoring alerts showed IO utilization exceeding 60% and application TP99 timeouts. The issue appeared consistently at 3:00 AM on several servers, with a peak of around 70% IO usage.

Investigation Directions

1. Scheduled Tasks – Check whether backup jobs or other cron tasks run at the same time. Large binlog files or automatic cleanup can increase IO, which can be verified via disk‑space monitoring.

2. High Write Concurrency – Heavy write workloads usually do not push IO util above 50%. Review general_log or slow_log and current thread connections to identify offending SQL statements.

3. Hardware Factors – IO util above 50% often points to hardware issues, especially failing RAID battery (BBU). When the BBU is defective and write‑cache is disabled, all writes go directly to disks, causing a sharp IO increase.

RAID Battery and Cache Policy (MegaRAID)

Most servers use LSI MegaRAID cards. Use MegaCli64 -PDList –aALL | egrep -i 'error|Firmware' to check for disk errors.

Check current cache policy:

/opt/MegaRAID/MegaCli/MegaCli64 -cfgdsply -aALL | grep Policy

Typical output:

Default Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, No Write Cache if Bad BBU

If the BBU is bad, the controller switches to No Write Cache mode, which raises IO.

To force cache usage even when the BBU is bad:

/opt/MegaRAID/MegaCli/MegaCli64 -LDSetProp CachedBadBBU -Lall -aALL

After the change the policy becomes:

Default Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU
Current Cache Policy: WriteBack, ReadAhead, Direct, Write Cache OK if Bad BBU

HP Server Commands (hpssacli)

Check battery status: hpssacli ctrl all show status Check cache policy: hpssacli ctrl all show detail | grep -i Cache Enable physical disk write cache:

hpssacli ctrl slot=0 modify drivewritecache=enable

Enable logical drive cache (when BBU is present):

hpssacli ctrl slot=0 logicaldrive 1 modify caching=enable

If the BBU is missing, enable write cache anyway:

hpssacli ctrl slot=0 modify nobatterywritecache=enable

Set read/write ratio:

hpssacli ctrl slot=0 modify cacheratio=10/90

Scenarios

No RAID battery (or battery damaged) – Without the WriteCache OK if Bad BBU mode, the controller falls back to Write‑Through, causing IO to jump to near‑100% during heavy load.

Battery in charge/discharge cycle – During the BBU learning cycle the controller disables WriteBack to protect data integrity, temporarily reducing performance.

BBU Charging Cycle

The BBU consists of a lithium‑ion cell and control circuitry. Its capacity degrades over time, and the controller periodically runs an Auto‑Learn cycle (≈1 hour) that fully charges, discharges, and re‑charges the battery. If the cycle is interrupted, calibration stops.

Commands to view BBU properties and status:

/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuProperties -aALL
/opt/MegaRAID/MegaCli/MegaCli64 -AdpBbuCmd -GetBbuStatus -aALL | grep "Charge"

Typical output shows relative state of charge (e.g., 50 %) and charger status (Recharging).

Supercapacitor vs. Lithium Battery

Some RAID cards use supercapacitors plus flash memory instead of lithium batteries. Supercapacitors last up to five years at 50 °C and do not require regular charge‑discharge cycles, eliminating the performance dip during learning.

Hardware Self‑Test Impact

Beyond battery issues, hardware self‑test itself can generate high IO. Logs from

/opt/MegaRAID/MegaCli/MegaCli64 -AdpEventLog -GetEvents -aALL -f 1.log

show patrol reads and consistency checks starting at 3:00 AM, coinciding with the IO spikes.

Because the self‑test runs on all servers simultaneously, any concurrent transactions experience slow SQL and timeouts.

BIOS Configuration Example (DELL / ThinkServer)

Enter BIOS via ILO F1 and change Boot Mode to UEFI Only .

Set Storage OpROM Policy to UEFI Only under Miscellaneous Boot Settings.

Save (F10) and reboot; verify Adapters and UEFI Drivers appear.

Navigate to Controller Management → Advanced Controller Management → Schedule Consistency Check and adjust the schedule.

Apply changes, then revert Boot Mode back to Legacy if required for OS boot.

After applying the new schedule, monitoring showed the early‑morning IO spikes disappeared.

Recommendations

Change hardware self‑test frequency from weekly to monthly, aligning with business cycles.

For major sales events (e.g., 618, Double‑11), verify that self‑test does not overlap with peak traffic.

Set RAID cache policy to WriteBack, ReadAdaptive, Direct, Write Cache if Bad BBU for best performance, but be aware of data‑loss risk if the BBU fails and power is lost.

Do not disable hardware self‑test; it provides early fault detection.

Keep BBU Auto‑Learn enabled to extend battery life.

For MySQL, set innodb_flush_method=O_DIRECT and avoid setting innodb_flush_log_at_trx_commit=0 and sync_binlog=0 unless absolutely necessary.

Following these steps mitigates unexpected IO spikes, improves database response times, and reduces the risk of data loss during hardware failures.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

OperationsmysqlHardwareioRAIDhpssacliMegaCli
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.