What Triggered These Real‑World System Crashes? 13 Post‑Mortem Lessons
The article compiles thirteen post‑mortem case studies of severe system outages—from AIX NTP misconfiguration and backup appliance driver issues to PowerHA node ID conflicts and hardware failures—detailing symptoms, root‑cause analysis, and practical remediation steps for each incident.
01. AIX NTP misconfiguration caused multiple cluster crashes
A friend reported that three Oracle RAC clusters on AIX machines rebooted simultaneously after a hardware relocation. Investigation revealed all clusters shared the same NTP server, but one used xntpd while the others used ntpdate via cron. The ntpdate jobs caused large time jumps, which made the cssd process trigger a system reboot. Lesson: Prefer the xntpd service for time synchronization instead of periodic ntpdate calls.
02. Backup appliance CDP driver caused a crash
During testing of an AIX backup appliance, the CDP driver was left installed after the client was removed. Upon reboot the system failed to start. The vendor confirmed that the CDP driver must be removed before uninstalling the client.
03. LVM mirror expansion error led to data loss
In a dual‑node, dual‑storage HA setup, expanding a filesystem by adding disks directly to the VG caused data to be unevenly distributed across the two storage arrays. When one storage failed, the system lost data integrity. Lesson: When using LVM mirrors, expand the logical volume first, then the filesystem.
04. HACMP node‑ID duplication caused cluster halt
Three PowerHA XD clusters shared identical RSCT node UUIDs after an alt_disk_copy without the -B -C -O options. The duplicate IDs caused quorum loss and a complete halt. The fix involved stopping HA services, reinstalling the RSCT node configuration, and rebooting all nodes.
05. Power 570/595 crash due to improper CDP driver removal
After uninstalling the backup client but leaving the CDP driver, the Power 595 failed to boot. The vendor required the CDP driver to be removed first.
06. ERP backup triggered HACMP crash
During a backup window, the haemd daemon repeatedly restarted, causing the Oracle database to stop. The issue stemmed from excessive I/O and insufficient filesystem cache, which was mitigated by adjusting Maxpout and Minpout parameters.
07. WebLogic memory‑leak crash investigation
Repeated out‑of‑memory errors were traced to non‑heap memory exhaustion. Adjusting PermSize in setDomainEnv.sh had no effect because JAVA_VENDOR was set to N/A. The final fix set a proper JAVA_VENDOR and added explicit memory arguments ( -Xms2048m -Xmx2048m -XX:PermSize=1024m).
08. P550/P570 HA crash and data loss
Power failure left both UPS units partially powered, causing both P550 nodes to shut down. After hardware replacement and manual IP aliasing, the HA cluster was restored, though some /orafile data was lost and later recovered from backup.
09. AIX 6100‑06‑06 bug causing kernel panic
The netstat -f unix command triggered a kernel panic due to a file‑lock bug (IV09793). The recommended fix is to apply the bos.mp64 patch or upgrade to level 6100‑06‑12‑1339 (SP12).
10. PowerHA node‑ID conflict during IP switch
When all IP networks were lost but a non‑IP network remained, PowerHA 6 dumped core (IV55293). Upgrading the rsct fileset resolved the issue.
11. Power595 crash caused by I/O cabinet power loss
During a routine I/O cabinet power‑swap, an unexpected power drop caused the Power595 to crash. Replacing the I/O DCA resolved the problem.
12. X86 server crash due to faulty optical drive
An IBM X3650 running SUSE 9 hung because a defective CD/DVD drive caused kernel panics. Replacing the drive restored stability.
13. Miscellaneous hardware‑related crashes
Additional incidents include UPS failures, firmware errors, and component replacements that led to temporary outages but were resolved through hardware swaps and firmware updates.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
