Operations 20 min read

What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

The article compiles thirteen post‑mortem case studies of severe system outages—from AIX NTP misconfiguration and backup appliance driver issues to PowerHA node ID conflicts and hardware failures—detailing symptoms, root‑cause analysis, and practical remediation steps for each incident.

Efficient Ops
Efficient Ops
Efficient Ops
What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

01. AIX NTP misconfiguration caused multiple cluster crashes

A friend reported that three Oracle RAC clusters on AIX machines rebooted simultaneously after a hardware relocation. Investigation revealed all clusters shared the same NTP server, but one used

xntpd

while the others used

ntpdate

via cron. The

ntpdate

jobs caused large time jumps, which made the

cssd

process trigger a system reboot. Lesson: Prefer the

xntpd

service for time synchronization instead of periodic

ntpdate

calls.

02. Backup appliance CDP driver caused a crash

During testing of an AIX backup appliance, the CDP driver was left installed after the client was removed. Upon reboot the system failed to start. The vendor confirmed that the CDP driver must be removed before uninstalling the client.

03. LVM mirror expansion error led to data loss

In a dual‑node, dual‑storage HA setup, expanding a filesystem by adding disks directly to the VG caused data to be unevenly distributed across the two storage arrays. When one storage failed, the system lost data integrity. Lesson: When using LVM mirrors, expand the logical volume first, then the filesystem.

04. HACMP node‑ID duplication caused cluster halt

Three PowerHA XD clusters shared identical RSCT node UUIDs after an

alt_disk_copy

without the

-B -C -O

options. The duplicate IDs caused quorum loss and a complete halt. The fix involved stopping HA services, reinstalling the RSCT node configuration, and rebooting all nodes.

05. Power 570/595 crash due to improper CDP driver removal

After uninstalling the backup client but leaving the CDP driver, the Power 595 failed to boot. The vendor required the CDP driver to be removed first.

06. ERP backup triggered HACMP crash

During a backup window, the

haemd

daemon repeatedly restarted, causing the Oracle database to stop. The issue stemmed from excessive I/O and insufficient filesystem cache, which was mitigated by adjusting

Maxpout

and

Minpout

parameters.

07. WebLogic memory‑leak crash investigation

Repeated out‑of‑memory errors were traced to non‑heap memory exhaustion. Adjusting

PermSize

in

setDomainEnv.sh

had no effect because

JAVA_VENDOR

was set to N/A. The final fix set a proper

JAVA_VENDOR

and added explicit memory arguments (

-Xms2048m -Xmx2048m -XX:PermSize=1024m

).

08. P550/P570 HA crash and data loss

Power failure left both UPS units partially powered, causing both P550 nodes to shut down. After hardware replacement and manual IP aliasing, the HA cluster was restored, though some

/orafile

data was lost and later recovered from backup.

09. AIX 6100‑06‑06 bug causing kernel panic

The

netstat -f unix

command triggered a kernel panic due to a file‑lock bug (IV09793). The recommended fix is to apply the bos.mp64 patch or upgrade to level 6100‑06‑12‑1339 (SP12).

10. PowerHA node‑ID conflict during IP switch

When all IP networks were lost but a non‑IP network remained, PowerHA 6 dumped core (IV55293). Upgrading the

rsct

fileset resolved the issue.

11. Power595 crash caused by I/O cabinet power loss

During a routine I/O cabinet power‑swap, an unexpected power drop caused the Power595 to crash. Replacing the I/O DCA resolved the problem.

12. X86 server crash due to faulty optical drive

An IBM X3650 running SUSE 9 hung because a defective CD/DVD drive caused kernel panics. Replacing the drive restored stability.

13. Miscellaneous hardware‑related crashes

Additional incidents include UPS failures, firmware errors, and component replacements that led to temporary outages but were resolved through hardware swaps and firmware updates.

case studyoperationsHigh Availabilitysystem crashAIXPowerHA
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.