Operations 10 min read

Troubleshooting OceanBase No-Leader Alerts Caused by Network Bandwidth Saturation

This article details a step‑by‑step investigation of daily OceanBase no‑leader alerts caused by network bandwidth saturation, covering log analysis, clock synchronization issues, RPC backlog, and provides practical solutions such as bandwidth expansion and backup throttling to restore cluster stability.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Troubleshooting OceanBase No-Leader Alerts Caused by Network Bandwidth Saturation

1 Problem Description

A production OceanBase cluster generates a "no leader" alarm around 07:00 each day, accompanied by brief service timeouts. The OCP alarm keyword is "eg leader lease is expired".

2 Analysis Process

2.1 Common Checks for No‑Leader Situation

Refer to the official documentation and first confirm that replicas are indeed in a no‑leader state, then investigate the following possible causes:

observer.log contains obvious error messages.

Clock drift.

Deleted tenant, table, or partition.

Majority of replicas down.

Network issues.

Clog module fails to recover logs.

High load.

Clog disk full.

2.2 Check RS Logs

Search the Root Service log for the keyword "clock between rs and server not sync":

grep "clock between rs and server not sync" rootservice.log.20240613072655

The result shows a warning indicating a clock mismatch between the RS node and the server.

2.3 Check observer.log

Verify clock desynchronization in observer.log :

grep -i "clock diff time is too large" observer.log.20240613070304

The warning confirms a large clock difference, prompting a check of network bandwidth pressure during the alarm period.

2.4 Examine tsar Logs

Run:

tsar -d 20240613 -i 1

The output shows outbound network traffic roughly ten times higher than normal.

2.5 Check RPC Message Backlog

Search for large "request doing" values in observer.log to detect RPC backlog:

grep 'RPC EASY STAT' observer.log.20240613070304 | awk -F 'request doing=' '{print $2}'

Some values reach the thousands, indicating significant RPC message accumulation.

2.6 Verify Network Interfaces

Use ip link or ifconfig to list interfaces and confirm that bond0 and bond1 are independent NICs.

2.7 Check NIC Speed

Run:

ethtool bond0

The speed is reported as 10000Mb/s, confirming a 10 Gbps NIC.

2.8 Verify Routing

Execute:

ip route

The routing table is correct.

Conclusion: Daily backup traffic saturates the network bandwidth, causing clock sync failures, which lead to lease expiration and the no‑leader condition.

3 Solution

3.1 Expand Bandwidth

Increase network bandwidth to alleviate backup‑time pressure.

3.2 Backup Rate Limiting

Adjust backup_net_limit (0 means no limit) and backup_concurrency (default 10). Setting backup_concurrency to 1 prolongs backup duration and mitigates the no‑leader issue.

Do not modify sys_bkgd_net_percentage as it throttles all observer traffic.

Reference

[1] No‑leader troubleshooting: https://www.oceanbase.com/docs/enterprise-oceanbase-database-cn-10000000000360700

troubleshootingOceanBaseClock Syncnetwork bandwidthNo-Leader
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.