Troubleshooting Compaction Stuck Issue in OceanBase: Diagnosis and Resolution
This article details a step‑by‑step investigation of a compaction‑stuck problem in OceanBase, covering background, environment setup, view and log analysis, root‑cause identification related to clock drift, and the corrective actions taken to restore normal merging.
1 Background
A client experienced a restart of an OBServer follower node whose clock service failed to start, causing the node to lag the time source by over 60 seconds. During cluster merge, zone3 remained in COMPACTING state, preventing merge completion. The issue is reproduced on both OB Community and Enterprise editions.
2 Environment Information
OceanBase: 4.2.1.4
Architecture: 1-1-1
zone1: 10.186.64.161 (RS Leader)
zone2: 10.186.64.162
zone3: 10.186.64.163
Clock source: 10.186.64.160
3 View Inspection
Check Cluster Tenant Information
The cluster has five tenants, with 1001 and 1003 being META tenants.
select * from __all_tenant;Check Tenant‑Level Merge Information
LAST_SCN indicates the version of the last completed merge, while GLOBAL_BROADCAST_SCN indicates the current merge version being broadcast.
LAST_SCN == GLOBAL_BROADCAST_SCN: current merge round has finished.
LAST_SCN != GLOBAL_BROADCAST_SCN: current merge round is still in progress and may be stuck.
All tenants are currently stuck in merge.
select * from cdb_ob_major_compaction;Check Tenant Role Information
Identify the leader of the first log stream for the stuck tenant.
Check tablets with compaction_scn less than GLOBAL_BROADCAST_SCN
The RS determines merge completion by verifying that tablet version numbers in each zone have been raised to the current merge version; if not, it updates __all_zone_merge_info and sets is_merging to false.
The virtual table __all_virtual_tablet_meta_table (under SYS/META tenant) can be queried without tenant switching.
select count(*) from __all_virtual_tablet_meta_table where tenant_id = 1 and compaction_scn < 1718165680404973713;Check ZONE‑level Merge Information
select * from cdb_ob_zone_major_compaction;Check Merge Diagnosis Information
RS_UNCOMPACTED indicates tablets that have not yet reached the current merge version; use GV$OB_COMPACTION_PROGRESS to see if merges are still running.
select * from __all_virtual_compaction_diagnose_info where create_time >= '2024-06-12%';4 Log Inspection
1. Verify tablets not pushed to current merge version
grep --color=always "replica not merged" rootservice.log | tail -102. Identify unmerged count per zone
grep "YB420ABA40A1-00061A34C8AC5770-0-0" rootservice.log* | grep --color=always "unmerged_cnt" | grep -E "zone1|zone2|zone3"zone3 shows 891 unmerged tablets.
3. Confirm tenant initiated merge with broadcast SCN
grep "try to schedule merge" observer.log.202406121* | grep "tenant_id" | grep "scn:{val:"4. Verify memtable snapshot version before merge
grep --color=always "ready for flush" observer.log* | grep -w T1 | grep --color=always "snapshot_version:{val" | tail -1Snapshot version is greater than the merge SCN, indicating memtable freeze.
5. Check if dump/merge sstable was generated
grep "sstable merge finish" observer.log* | grep -v "ret=0" | grep --color=always "ret="6. Verify tenant merge report
grep "REPORT: batch update tablets" observer.log | grep "ret=-"7. Find error trace_id and remote address
grep "YB420ABA40A3-00061A354A6E3DD4-0-0" observer.log* | grep "original error message"Remote address is 10.186.64.161.
8. Search for packet wait timeout
grep "YB420ABA40A3-00061A354A6E3DD4-0-0" observer.log* | grep "packet wait too much time"9. Check clock offset with time source
clockdiff 10.186.64.160zone3 lags the clock source by 65 seconds.
5 Solution
Temporarily move tenant leader to zone3 node (not recommended).
Adjust system time on zone3 node to correct value (recommended).
1. Restore correct time
systemctl stop ntpd
date && ntpdate 10.186.64.160 && date
systemctl start ntpd && systemctl status ntpd
clockdiff 10.186.64.1602. Verify merge status
select * from cdb_ob_zone_major_compaction;zone3 shows successful merge.
6 Conclusion
The merge was stuck because the __all_tablet_meta_table update timed out (OB_TRANS_TIMEOUT) due to a 65‑second clock drift on the zone3 OBServer, preventing compaction_scn reporting and leaving the zone in COMPACTING state.
7 Optimization Measures
Properly configure clock source and ensure clock service auto‑starts.
Monitor clock‑delay related alerts.
Tip
Use obdiag or OBStack for compaction‑stuck analysis.
References
How to troubleshoot compaction stuck issues.
OceanBase RS compaction stuck troubleshooting manual.
How to adjust OBServer system time.
Compaction_diagnose view guide.
Reason for long RPC fly_ts when pkt‑nio is enabled.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.