Databases 13 min read

Say Goodbye to Repeated Pitfalls with Our Open‑Source AI Skill for Database Troubleshooting

The article introduces starrocks‑debug‑skills, an open‑source, three‑layer knowledge base (Skills, Cases, Tools) that captures real‑world StarRocks troubleshooting experience, shows how AI assistants can use it to diagnose issues such as import timeouts, version errors, and compaction slowdowns, and explains how to contribute new cases.

StarRocks
StarRocks
StarRocks
Say Goodbye to Repeated Pitfalls with Our Open‑Source AI Skill for Database Troubleshooting

In production environments, troubleshooting knowledge is often scattered across Slack messages, post‑mortem reports, and personal notes, leading to repeated investigations, uneven response quality, knowledge loss, and bottlenecks when experts are unavailable.

Why a Structured Knowledge Base?

Static runbooks become outdated and lack context. Engineers need a path that guides them from symptom to hypothesis, then to diagnostic commands, and finally to the next step.

Three‑Layer AI‑Ready Knowledge Base

Skills : Systematic troubleshooting guides for each problem domain (e.g., query failures, import timeouts, Compaction tuning). Each skill includes a decision‑tree style flow.

Cases : Over 25 anonymized real‑world incidents, each documenting symptom, investigation steps, root cause, solution, and lessons learned.

Tools : Ready‑to‑copy diagnostic commands (log search, profile collection, stack capture, network checks, information_schema queries).

The entry file SKILL.md maps a given symptom to the appropriate skill and supplies the exact commands to run.

Scenario 1 – Import Timeout

Error: Reached timeout=30000ms. The skill identifies the bottleneck in the write path and suggests checking thread‑pool saturation with:

# Check BE thread‑pool saturation
curl -s http://<be_ip>:<be_http_port>/metrics | grep "thread_pool"
# Key pools: async_delta_writer, memtable_flush, segment_replicate_sync

If segment_replicate_sync is the culprit, the skill recommends increasing flush_thread_num_per_store from 2 to 4‑6, explaining that high‑concurrency Routine Load can fill the sync threads and trigger a 30 s BRPC timeout.

Scenario 2 – "Version Does Not Exist" Error

The skill treats this as a typical FE deadlock signal (Case‑003) and directs the user to inspect the FE Report timestamp. It then captures the FE stack trace:

# Capture FE stack trace snapshots
jstack <fe_pid> > /tmp/fe_jstack_$(date +%s).log
grep -A 30 "ReportHandler" /tmp/fe_jstack_*.log

If LockManager.lock is in TIMED_WAITING, the deadlock is confirmed; the recommended action is to collect more dumps and restart the FE.

Scenario 3 – Compaction Backlog in a Compute‑Separate Cluster

In a cluster with 120 ingest BE, 80 query BE, and 10 compaction BE, a P99 latency spike is traced to DataCache autoscaling. The skill runs:

# Detect DataCache autoscaling events
grep "autoscaling" be.WARNING

When disk usage exceeds 80 %, DataCache shrinks, forcing Compaction to read from S3, which slows it down. The skill advises disabling autoscaling in a dedicated Warehouse architecture and setting num_partitioned_prefix to 100 while disabling global profiling.

Integrating with AI Assistants

Any AI tool that supports custom instructions, rules, or skills can import the repository. Example for Claude:

git clone https://github.com/StarRocks/starrocks-debug-skills.git

Place SKILL.md in .claude/settings.json or reference it in the system prompt. For Cursor/Windsurf, copy SKILL.md to .cursor/rules/ or add it to the workspace rules. The same file can be pasted into any AI chat window to provide the full troubleshooting flow.

Contribution Guidelines

Contributors add new cases via CONTRIBUTING.md, following the template:

# Symptom
[What the user sees]

# Investigation
[Step‑by‑step commands and log patterns]

# Root Cause
[What actually broke and why]

# Resolution
[Short‑term mitigation + long‑term fix]

All data must be anonymized, preferably written in English, and commands must be verified. The goal is to capture actionable “investigation logic” rather than a simple record.

Repository: https://github.com/StarRocks/starrocks-debug-skills

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

AIoperationsStarRocksOpen-sourceDatabase Troubleshooting
StarRocks
Written by

StarRocks

StarRocks is an open‑source project under the Linux Foundation, focused on building a high‑performance, scalable analytical database that enables enterprises to create an efficient, unified lake‑house paradigm. It is widely used across many industries worldwide, helping numerous companies enhance their data analytics capabilities.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.