Why Do Oracle RAC Nodes Crash? Uncovering Memory Leak Bugs in ocssd.bin
An Oracle RAC deployment across four identical clusters repeatedly experienced node crashes, each traced to ocssd.bin failures caused by a memory leak bug in CRSD/OCSSD processes (Bug 11704113), with analysis of logs, kernel parameters, and recommended patches and configuration tweaks to prevent recurrence.
Background
A telecom business system in a province was deployed on four identical Oracle RAC (Real Application Clusters) configurations, each with the same host version, CRS version, and database version. The system ran stably for over three years, but starting on 24 April, nodes began crashing one after another.
Observed Failures
xx1db01 – 04‑24
xx2db01 – 07‑30
xx2db02 – 07‑23
xx3db01 – 08‑19
xx3db02 – 07‑27
xx4db01 – (no crash recorded)
xx4db02 – (no crash recorded)
Common Symptom
Each crash was triggered by an ocssd.bin process failure. The log consistently contained the message: clssscExit: CSSD signal 11 in thread GMClientListener Signal 11 indicates a segmentation fault, typically caused by memory allocation failure.
System Configuration
Host OS: HP‑UX B.11.31 U ia64
CRS version: 10.2.0.5.0 (no PSU applied)
Database version: 10.2.0.5.8 (PSU 13923855)
Storage: Veritas Storage Foundation with VCS cluster file system
Analysis of Individual Crashes
1. 24 April – xx1db01
No OSWatch tool was installed, so OS resource data was unavailable. System logs showed no errors; the database alert log was clean. The ocssd.log contained the above CSSD signal message. Oracle MOS identified a possible bug (Bug 9132429) that could cause deadlocks in the GM client listener.
2. 23 July – xx2db02
Analysis matched the April case: only the ocssd.log provided information, pointing again to the GMClientListener issue. MOS suggested possible host resource problems, but without OSWatch data no concrete cause was identified.
3. 27 July – xx3db02
The ocssd.log showed an “Authentication OSD error”, suggesting a failed authentication between nodes. A checklist ruled out firewall, authentication tools, directory permission changes, file system full, missing .oracle directory, and network packet issues.
4. 30 July – xx2db01
Collected ulimit settings:
root@xx2db01:/# ulimit -a
time(seconds) unlimited
file(blocks) unlimited
data(kbytes) 4194300
stack(kbytes) 392192
memory(kbytes) unlimited
coredump(blocks) 4194303
nofiles(descriptors) 20480Kernel parameters related to file handling:
fcache_seqlimit_file 100
filecache_max 130470211584
filecache_min 13047017472
max_acct_file_size 2560000
maxfiles 20480
maxfiles_lim 20480Kernel nproc values were normal. VMSTAT during the crash showed ~11 GB free memory on a 256 GB host, with occasional paging. The ocssd.bin process consumed about 8 GB of memory.
5. 19 August – xx3db01
Again, the CSSD signal 11 appeared. VMSTAT indicated ~80 GB free memory, ruling out memory exhaustion. OSWatch revealed ocssd.bin memory usage of ~8 GB at crash time.
Root Cause
Investigation uncovered a memory leak in the crsd.bin process (Bug 11704113) that gradually increases memory consumption of both crsd.bin and ocssd.bin after upgrading to Oracle 10.2.0.5. The leak eventually hits the HP‑UX kernel limit maxdsiz_64bit (≈8 GB), causing the ocssd.bin process to fail with signal 11.
Recommendations
Apply the CRS patch for Bug 11704113.
Monitor ocssd.bin memory usage and perform a manual CRS restart before the process reaches the kernel limit.
Reduce filecache_max from ~130 GB to around 10 GB to prevent excessive file system cache consumption.
Increase CSSD log level to 4 for more detailed diagnostics: crsctl debug log css CSSD:4.
Set ulimit for data and stack to unlimited for the Oracle user.
Author
Pei Zhengfeng – Senior Oracle DBA at Beijing Haitai Qidian, member of the second‑line support team, responsible for on‑site maintenance, performance analysis, and issue resolution.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
