OceanBase Timeout During Merge: Diagnosis, Emergency Handling, and Optimization
This article details a timeout incident in an OceanBase cluster during a merge operation, explains the emergency suspension and resumption steps, analyzes log and metric data to identify queue backlog and disk I/O saturation as root causes, and offers practical optimization recommendations.
1 Problem Background
At around 04:25, the OceanBase cluster reported a java.sql.SQLException: Timeout error on the business application side. OCP alerts showed a large number of easy_connection_on_timeout_conn warnings. The batch SQL tasks were scheduled during this period, but the cluster was performing a merge operation.
2 Emergency Plan
Because batch tasks have higher priority, the merge operation was paused. Around 05:50 the merge was suspended, allowing batch jobs to resume normally.
-- sys tenant execution
ALTER SYSTEM SUSPEND MERGE;After the batch completed, the merge was resumed.
-- sys tenant execution
ALTER SYSTEM RESUME MERGE3 Problem Investigation
After the emergency actions, the root cause was investigated.
1. Check observer.log
Filtered the observer log for the relevant time window:
grep -i "sending error packet" observer.logThe log showed entries indicating transaction timeout and rollback, with error code err=-4012\-6224 representing these conditions.
2. Confirm Queue Backlog
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645
# optional clearer view
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep req_queue
grep ' dump tenant info(tenant={id:1001' observer.log.20241010042645 | sed 's/,/,
/g' | grep multi_level_queueKey metrics such as req_queue:total_size , multi_level_queue:total_size , group_id = * , and queue_size were examined; non‑zero values indicated backlog.
Conclusion: The direct cause of the SQL timeout was tenant queue backlog.
3. Check tsar Logs
tsar -d 20241010 -i 1Network retransmission rate on the alert node exceeded 0.2, which contributed to the large number of easy_connection_on_timeout_conn alerts.
Disk sdb (the OB data disk) usage reached 100% between 04:20‑04:30, causing I/O saturation and queue buildup.
Conclusion
During the merge window, disk I/O was fully occupied. Concurrent batch jobs added further pressure, leading to queue accumulation. OceanBase’s RPC ack_timeout is set to 10 seconds; connections exceeding this are dropped, manifesting as SQL response timeouts.
4 Optimization Suggestions
Adjust daily merge schedule to avoid overlapping with batch jobs. Merges increase disk I/O; batch tasks also consume resources, causing performance bottlenecks. Recommend separating merge and batch operations.
Reduce batch concurrency; run tasks sequentially to lower system load.
Consider business segmentation to isolate heavy workloads such as batch, merge, and backup.
References
[1] Queue field information: https://www.oceanbase.com/docs/common-oceanbase-database-cn-1000000000819396
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.