Operations 20 min read

What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

The article compiles thirteen post‑mortem case studies of severe system outages—from AIX NTP misconfiguration and backup appliance driver issues to PowerHA node ID conflicts and hardware failures—detailing symptoms, root‑cause analysis, and practical remediation steps for each incident.

Efficient Ops

Jan 2, 2018

What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons

01. AIX NTP misconfiguration caused multiple cluster crashes

A friend reported that three Oracle RAC clusters on AIX machines rebooted simultaneously after a hardware relocation. Investigation revealed all clusters shared the same NTP server, but one used xntpd while the others used ntpdate via cron. The ntpdate jobs caused large time jumps, which made the cssd process trigger a system reboot. Lesson: Prefer the xntpd service for time synchronization instead of periodic ntpdate calls.

02. Backup appliance CDP driver caused a crash

During testing of an AIX backup appliance, the CDP driver was left installed after the client was removed. Upon reboot the system failed to start. The vendor confirmed that the CDP driver must be removed before uninstalling the client.

03. LVM mirror expansion error led to data loss

In a dual‑node, dual‑storage HA setup, expanding a filesystem by adding disks directly to the VG caused data to be unevenly distributed across the two storage arrays. When one storage failed, the system lost data integrity. Lesson: When using LVM mirrors, expand the logical volume first, then the filesystem.

04. HACMP node‑ID duplication caused cluster halt

Three PowerHA XD clusters shared identical RSCT node UUIDs after an alt_disk_copy without the -B -C -O options. The duplicate IDs caused quorum loss and a complete halt. The fix involved stopping HA services, reinstalling the RSCT node configuration, and rebooting all nodes.

05. Power 570/595 crash due to improper CDP driver removal

After uninstalling the backup client but leaving the CDP driver, the Power 595 failed to boot. The vendor required the CDP driver to be removed first.

06. ERP backup triggered HACMP crash

During a backup window, the haemd daemon repeatedly restarted, causing the Oracle database to stop. The issue stemmed from excessive I/O and insufficient filesystem cache, which was mitigated by adjusting Maxpout and Minpout parameters.

07. WebLogic memory‑leak crash investigation

Repeated out‑of‑memory errors were traced to non‑heap memory exhaustion. Adjusting PermSize in setDomainEnv.sh had no effect because JAVA_VENDOR was set to N/A. The final fix set a proper JAVA_VENDOR and added explicit memory arguments ( -Xms2048m -Xmx2048m -XX:PermSize=1024m).

08. P550/P570 HA crash and data loss

Power failure left both UPS units partially powered, causing both P550 nodes to shut down. After hardware replacement and manual IP aliasing, the HA cluster was restored, though some /orafile data was lost and later recovered from backup.

09. AIX 6100‑06‑06 bug causing kernel panic

The netstat -f unix command triggered a kernel panic due to a file‑lock bug (IV09793). The recommended fix is to apply the bos.mp64 patch or upgrade to level 6100‑06‑12‑1339 (SP12).

10. PowerHA node‑ID conflict during IP switch

When all IP networks were lost but a non‑IP network remained, PowerHA 6 dumped core (IV55293). Upgrading the rsct fileset resolved the issue.

11. Power595 crash caused by I/O cabinet power loss

During a routine I/O cabinet power‑swap, an unexpected power drop caused the Power595 to crash. Replacing the I/O DCA resolved the problem.

12. X86 server crash due to faulty optical drive

An IBM X3650 running SUSE 9 hung because a defective CD/DVD drive caused kernel panics. Replacing the drive restored stability.

13. Miscellaneous hardware‑related crashes

Additional incidents include UPS failures, firmware errors, and component replacements that led to temporary outages but were resolved through hardware swaps and firmware updates.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

case study Operations system crash AIX PowerHA

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.