Analyzing OceanBase Error Logs to Locate Error Causes
This article explains the types and formats of OceanBase logs, how to identify relevant log files, interpret log fields such as trace_id, lt, and dc, and provides step-by-step methods for using error codes and trace IDs to pinpoint the root cause of errors.
OceanBase generates a large number of logs at various levels during runtime. When errors occur, locating the root cause among these logs can be challenging. This guide describes how to find the desired error information in OceanBase error logs.
1. Log Files
OceanBase logs are divided into three categories:
Election module logs: stored in election module logs.
RootService (total control service) module logs: stored in RootService module logs.
Startup and runtime logs: store logs from all other modules.
Each category contains two types of log files:
.log and .log.YYYYmmDDHHMMSS contain logs of all levels .
.log.wf and .log.wf.YYYYmmDDHHMMSS contain WARN , USER_ERROR , and ERROR levels (requires enable_syslog_wf set to true).
Each log type writes to its corresponding .log (or .log.wf ) file; when the file reaches 256 MB it is renamed with a .YYYYmmDDHHMMSS suffix and a new file is created.
Listing the log directory with ls -l shows files such as:
-rw-r--r-- 1 admin admin 18688919 Oct 9 02:08 election.log
-rw-r--r-- 1 admin admin 4998884 Oct 9 02:07 election.log.wf
-rw-r--r-- 1 admin admin 75158675 Oct 9 02:08 rootservice.log
-rw-r--r-- 1 admin admin 268437081 Oct 8 23:25 rootservice.log.20221008232523
-rw-r--r-- 1 admin admin 61688030 Oct 9 02:08 rootservice.log.wf
-... (other log files)2. Log Formats
Based on log level, OceanBase uses two formats:
DEBUG, TRACE, INFO format: [time] log_level [module_name] file_name:line_no [thread_id][coroutine_id][Ytraceid0-traceid1] [lt=last_log_print_time] [dc=dropped_log_count] log_data
WARN, USER_ERROR, ERROR format (adds function_name field): [time] log_level [module_name] function_name (file_name:line_no) [thread_id][coroutine_id][Ytraceid0-traceid1] [lt=last_log_print_time] [dc=dropped_log_count] log_data
Key fields:
trace_id : a unique identifier for a SQL across the cluster, e.g., YB420ABA3C91-0005E98FC3148792 .
lt (last_log_print_time): for asynchronous logging it records the time spent formatting the log; for synchronous logging it records the time spent writing the previous log to disk.
dc (dropped_log_count): number of logs that failed to be written since the previous successful log (only present in asynchronous logs).
3. Analyzing Logs
When an SQL execution fails, OceanBase returns an error code and message, which may be vague. To obtain clearer information, follow two steps.
3.1 Find trace_id by error code
Search the error code (with a leading negative sign) in the following order:
observer.log.wf
If not found, rootservice.log.wf
If still not found, election.log.wf
Example: creating a resource pool returns error 4624 . Use:
grep "ret=-4624" observer.log.wfThe grep output contains the trace_id YB420ABA3C91-0005E98FC5948785 .
3.2 Find error logs by trace_id
After obtaining the trace_id, grep it in .log.wf files (starting with observer.log.wf ). If the needed information is not in the first file, continue with rootservice.log.wf and election.log.wf . This ensures you capture the full sequence of logs related to the failing SQL.
4. Existing Problems
In practice, OceanBase error logs sometimes do not provide a clear cause, requiring manual experience and the "guess‑verify" approach to trace the issue.
Example: a BOOTSTRAP command fails with ERROR 4012 (HY000): Timeout . The closest log shows:
[2022-10-09 09:05:32.052665] WARN [BOOTSTRAP] execute_bootstrap (ob_bootstrap.cpp:759) [16840][446][YB420ABA3C91-0005EA9633379D17] [lt=21] [dc=0] failed to wait all rs online(ret=-4622)The issue was that all OBServer instances were started with zone z1 , while the BOOTSTRAP command specified zones z1, z2, z3 , causing the cluster to wait for non‑existent RootService nodes.
5. Expectations for Future Versions
Better error cause identification: logs should record the precise reason for failures to aid operations.
Include call‑stack hierarchy in .log.wf entries so that the method that triggered the error can be quickly located.
Example of a desired call‑stack prefix:
[2022-10-09 06:39:25.317429] WARN 3.2.1.1.1 log_user_error_and_warn (ob_rpc_proxy.cpp:300) [3414][2255][YB420ABA3C91-0005E98FC5948785] [lt=5] [dc=0] machine resource 'z1' is not enough to hold a new unit
... (subsequent logs with decreasing hierarchy numbers)Including such hierarchical identifiers would allow operators to pinpoint the exact method where the error originated, improving troubleshooting efficiency.
Aikesheng Open Source Community
The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.