Databases 13 min read

Troubleshooting Compaction Stuck Issue in OceanBase: Diagnosis and Resolution

This article details a step‑by‑step investigation of a compaction‑stuck problem in OceanBase, covering background, environment setup, view and log analysis, root‑cause identification related to clock drift, and the corrective actions taken to restore normal merging.

Aikesheng Open Source Community

Oct 15, 2024

Troubleshooting Compaction Stuck Issue in OceanBase: Diagnosis and Resolution

1 Background

A client experienced a restart of an OBServer follower node whose clock service failed to start, causing the node to lag the time source by over 60 seconds. During cluster merge, zone3 remained in COMPACTING state, preventing merge completion. The issue is reproduced on both OB Community and Enterprise editions.

2 Environment Information

OceanBase: 4.2.1.4

Architecture: 1-1-1

zone1: 10.186.64.161 (RS Leader)

zone2: 10.186.64.162

zone3: 10.186.64.163

Clock source: 10.186.64.160

3 View Inspection

Check Cluster Tenant Information

The cluster has five tenants, with 1001 and 1003 being META tenants.

select * from __all_tenant;

Check Tenant‑Level Merge Information

LAST_SCN indicates the version of the last completed merge, while GLOBAL_BROADCAST_SCN indicates the current merge version being broadcast.

LAST_SCN == GLOBAL_BROADCAST_SCN: current merge round has finished.

LAST_SCN != GLOBAL_BROADCAST_SCN: current merge round is still in progress and may be stuck.

All tenants are currently stuck in merge.

select * from cdb_ob_major_compaction;

Check Tenant Role Information

Identify the leader of the first log stream for the stuck tenant.

Check tablets with compaction_scn less than GLOBAL_BROADCAST_SCN

The RS determines merge completion by verifying that tablet version numbers in each zone have been raised to the current merge version; if not, it updates __all_zone_merge_info and sets is_merging to false.

The virtual table __all_virtual_tablet_meta_table (under SYS/META tenant) can be queried without tenant switching.

select count(*) from __all_virtual_tablet_meta_table where tenant_id = 1 and compaction_scn < 1718165680404973713;

Check ZONE‑level Merge Information

select * from cdb_ob_zone_major_compaction;

Check Merge Diagnosis Information

RS_UNCOMPACTED indicates tablets that have not yet reached the current merge version; use GV$OB_COMPACTION_PROGRESS to see if merges are still running.

select * from __all_virtual_compaction_diagnose_info where create_time >= '2024-06-12%';

4 Log Inspection

1. Verify tablets not pushed to current merge version

grep --color=always "replica not merged" rootservice.log | tail -10

2. Identify unmerged count per zone

grep "YB420ABA40A1-00061A34C8AC5770-0-0" rootservice.log* | grep --color=always "unmerged_cnt" | grep -E "zone1|zone2|zone3"

zone3 shows 891 unmerged tablets.

3. Confirm tenant initiated merge with broadcast SCN

grep "try to schedule merge" observer.log.202406121* | grep "tenant_id" | grep "scn:{val:"

4. Verify memtable snapshot version before merge

grep --color=always "ready for flush" observer.log* | grep -w T1 | grep --color=always "snapshot_version:{val" | tail -1

Snapshot version is greater than the merge SCN, indicating memtable freeze.

5. Check if dump/merge sstable was generated

grep "sstable merge finish" observer.log* | grep -v "ret=0" | grep --color=always "ret="

6. Verify tenant merge report

grep "REPORT: batch update tablets" observer.log | grep "ret=-"

7. Find error trace_id and remote address

grep "YB420ABA40A3-00061A354A6E3DD4-0-0" observer.log* | grep "original error message"

Remote address is 10.186.64.161.

8. Search for packet wait timeout

grep "YB420ABA40A3-00061A354A6E3DD4-0-0" observer.log* | grep "packet wait too much time"

9. Check clock offset with time source

clockdiff 10.186.64.160

zone3 lags the clock source by 65 seconds.

5 Solution

Temporarily move tenant leader to zone3 node (not recommended).

Adjust system time on zone3 node to correct value (recommended).

1. Restore correct time

systemctl stop ntpd
date && ntpdate 10.186.64.160 && date
systemctl start ntpd && systemctl status ntpd
clockdiff 10.186.64.160

2. Verify merge status

select * from cdb_ob_zone_major_compaction;

zone3 shows successful merge.

6 Conclusion

The merge was stuck because the __all_tablet_meta_table update timed out (OB_TRANS_TIMEOUT) due to a 65‑second clock drift on the zone3 OBServer, preventing compaction_scn reporting and leaving the zone in COMPACTING state.

7 Optimization Measures

Properly configure clock source and ensure clock service auto‑starts.

Monitor clock‑delay related alerts.

Tip

Use obdiag or OBStack for compaction‑stuck analysis.

References

How to troubleshoot compaction stuck issues.

OceanBase RS compaction stuck troubleshooting manual.

How to adjust OBServer system time.

Compaction_diagnose view guide.

Reason for long RPC fly_ts when pkt‑nio is enabled.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL database compaction troubleshooting OceanBase clock drift

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.