Databases 13 min read

Why Do Oracle RAC Nodes Crash? Uncovering Memory Leak Bugs in ocssd.bin

An Oracle RAC deployment across four identical clusters repeatedly experienced node crashes, each traced to ocssd.bin failures caused by a memory leak bug in CRSD/OCSSD processes (Bug 11704113), with analysis of logs, kernel parameters, and recommended patches and configuration tweaks to prevent recurrence.

dbaplus Community
dbaplus Community
dbaplus Community
Why Do Oracle RAC Nodes Crash? Uncovering Memory Leak Bugs in ocssd.bin

Background

A telecom business system in a province was deployed on four identical Oracle RAC (Real Application Clusters) configurations, each with the same host version, CRS version, and database version. The system ran stably for over three years, but starting on 24 April, nodes began crashing one after another.

Observed Failures

xx1db01 – 04‑24

xx2db01 – 07‑30

xx2db02 – 07‑23

xx3db01 – 08‑19

xx3db02 – 07‑27

xx4db01 – (no crash recorded)

xx4db02 – (no crash recorded)

Common Symptom

Each crash was triggered by an ocssd.bin process failure. The log consistently contained the message: clssscExit: CSSD signal 11 in thread GMClientListener Signal 11 indicates a segmentation fault, typically caused by memory allocation failure.

System Configuration

Host OS: HP‑UX B.11.31 U ia64

CRS version: 10.2.0.5.0 (no PSU applied)

Database version: 10.2.0.5.8 (PSU 13923855)

Storage: Veritas Storage Foundation with VCS cluster file system

Analysis of Individual Crashes

1. 24 April – xx1db01

No OSWatch tool was installed, so OS resource data was unavailable. System logs showed no errors; the database alert log was clean. The ocssd.log contained the above CSSD signal message. Oracle MOS identified a possible bug (Bug 9132429) that could cause deadlocks in the GM client listener.

2. 23 July – xx2db02

Analysis matched the April case: only the ocssd.log provided information, pointing again to the GMClientListener issue. MOS suggested possible host resource problems, but without OSWatch data no concrete cause was identified.

3. 27 July – xx3db02

The ocssd.log showed an “Authentication OSD error”, suggesting a failed authentication between nodes. A checklist ruled out firewall, authentication tools, directory permission changes, file system full, missing .oracle directory, and network packet issues.

4. 30 July – xx2db01

Collected ulimit settings:

root@xx2db01:/# ulimit -a
 time(seconds) unlimited
 file(blocks) unlimited
 data(kbytes) 4194300
 stack(kbytes) 392192
 memory(kbytes) unlimited
 coredump(blocks) 4194303
 nofiles(descriptors) 20480

Kernel parameters related to file handling:

fcache_seqlimit_file 100
 filecache_max 130470211584
 filecache_min 13047017472
 max_acct_file_size 2560000
 maxfiles 20480
 maxfiles_lim 20480

Kernel nproc values were normal. VMSTAT during the crash showed ~11 GB free memory on a 256 GB host, with occasional paging. The ocssd.bin process consumed about 8 GB of memory.

5. 19 August – xx3db01

Again, the CSSD signal 11 appeared. VMSTAT indicated ~80 GB free memory, ruling out memory exhaustion. OSWatch revealed ocssd.bin memory usage of ~8 GB at crash time.

Root Cause

Investigation uncovered a memory leak in the crsd.bin process (Bug 11704113) that gradually increases memory consumption of both crsd.bin and ocssd.bin after upgrading to Oracle 10.2.0.5. The leak eventually hits the HP‑UX kernel limit maxdsiz_64bit (≈8 GB), causing the ocssd.bin process to fail with signal 11.

Recommendations

Apply the CRS patch for Bug 11704113.

Monitor ocssd.bin memory usage and perform a manual CRS restart before the process reaches the kernel limit.

Reduce filecache_max from ~130 GB to around 10 GB to prevent excessive file system cache consumption.

Increase CSSD log level to 4 for more detailed diagnostics: crsctl debug log css CSSD:4.

Set ulimit for data and stack to unlimited for the Oracle user.

Author

Pei Zhengfeng – Senior Oracle DBA at Beijing Haitai Qidian, member of the second‑line support team, responsible for on‑site maintenance, performance analysis, and issue resolution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

memory leakdatabase troubleshootingOracle RACCRS bugocssd.bin
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.