Databases 14 min read

Why Did My OceanBase Cluster Crash? Uncovering OOM and Disk Errors

This article walks through a real‑world OceanBase cluster outage, detailing the environment, emergency restart steps, log‑driven fault analysis of disk I/O failures and Linux OOM‑killer activation, and presents concrete mitigation measures to prevent similar incidents.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Why Did My OceanBase Cluster Crash? Uncovering OOM and Disk Errors

Background

A client reported that an OceanBase business cluster (version 421) became unavailable at around 19:00. The cluster consists of three nodes:

OBServer1: 88.88.88.32

OBServer2: 88.88.88.33

OBServer3: 88.88.88.34

Emergency Handling Process

To restore service quickly, the operators skipped root‑cause analysis and restarted the missing OBServer processes on the two abnormal nodes.

Start OBServer2 (node 33)

# Switch to admin user
su - admin

# Verify OBServer is not running
ps -ef | grep OBServer

# Start OBServer
cd /home/admin/oceanbase/
/home/admin/oceanbase/bin/OBServer

# Confirm the process is running
ps -ef | grep OBServer

Start OBServer3 (node 34)

# Switch to admin user
su - admin

# Verify OBServer is not running
ps -ef | grep OBServer

# Start OBServer
cd /home/admin/oceanbase/
/home/admin/oceanbase/bin/OBServer

# Confirm the process is running
ps -ef | grep OBServer

Verify Cluster Status

After all three OBServer processes were up, the cluster status was checked via the metadata view:

# Connect to the sys tenant
obclient -h88.88.88.32 -P2883 -uroot@sys -pxxxxxxxxxxx -A

# Query server status
select gmt_modified, svr_ip, status, stop_time, start_service_time from __all_server;

The result should show status = 'active' , stop_time = 0 , and a non‑zero start_service_time for each node.

Fault Cause Investigation

Node 33 (OBServer2) Failure

OBServer Log ( /home/admin/oceanbase/log/OBServer.log) shows that logging stopped at 2025‑04‑21 12:23:00 and contains the message:

clog disk may be hang or something error has happen!

This indicates a problem with the redo‑log disk.

OS Message Log reveals I/O errors on /dev/sda and a SCSI reset triggered by the smartpqi driver:

Apr 21 12:23:00 ob2 kernel: smartpqi 0000:5e:00.0: resetting scsi 15:1:0:0
Apr 21 12:23:00 ob2 kernel: print_req_error: I/O error, dev sda, sector 632512664
Apr 21 12:23:00 ob2 kernel: print_req_error: I/O error, dev sda, sector 805802552

Interpretation: the storage device experienced a bad block or hang, causing the redo‑log disk to become unavailable and the OBServer process to exit.

Node 34 (OBServer3) Failure

OBServer Log stopped at 2025‑05‑07 04:55:42 , indicating the process had exited.

OS Message Log shows the process was killed by the Linux OOM‑killer:

May 7 04:55:42 ob3 kernel: Killed process 3391142 (OBServer) total-vm:216443928kB, anon-rss:187603704kB, file-rss:87512kB, shmem-rss:0kB

The node has 251 GB total memory; the OBServer memory limit is configured at 80 % (≈200 GB). At the time of the OOM event, OBServer used about 178 GB (RSS), which is within the limit, but the system could not reclaim dirty pages quickly enough. The kernel therefore triggered OOM to satisfy a large memory allocation request.

Key points about the OOM mechanism:

Page cache is reclaimable only after dirty pages are flushed to disk.

If dirty pages accumulate faster than the flush rate, the kernel may run out of instantly available memory even though total cache size is large.

When a process requests memory and the kernel cannot free enough pages, OOM‑killer selects a victim (here, OBServer) and terminates it.

Conclusions and Mitigation

Why the Cluster Became Unavailable

Node 33 crashed on 2025‑04‑21 12:23 due to a disk I/O error (bad block) on its redo‑log disk.

Node 34 crashed on 2025‑05‑07 04:55 because the OOM‑killer terminated OBServer after memory pressure and slow dirty‑page reclamation.

With two of three nodes down, the cluster lost quorum and could not serve requests.

Mitigation Measures

Adjust Linux dirty‑page thresholds to limit the amount of dirty data that can occupy memory, reducing the risk of OOM during heavy write workloads:

# Add to /etc/sysctl.conf
vm.dirty_background_ratio = 10   # Allow up to 10% of memory to hold dirty pages
vm.dirty_ratio = 20              # Hard limit at 20% of memory

# Apply the changes
sysctl -p

Monitoring should include:

Regular checks of __all_server to ensure all nodes report status = active , stop_time = 0 , and a valid start_service_time .

OS-level alerts for I/O errors and SCSI resets on storage devices.

Memory usage and dirty‑page statistics (e.g., vmstat, cat /proc/meminfo) to detect approaching OOM conditions before the kernel kills critical processes.

LinuxClusterOOMOceanBaseDisk Failure
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.