How VeDB Accelerates Fault Diagnosis with AI‑Powered RCA and Full‑SQL Insight
VeDB, ByteDance's cloud‑native distributed database, combines a four‑layer monitoring architecture, AI‑driven root‑cause analysis, full‑SQL insight, and second‑level metric collection to dramatically shorten MTTR, improve reliability, and support massive, high‑concurrency workloads across diverse production scenarios.
Overview
VeDB is ByteDance's self‑developed cloud‑native distributed database that separates compute and storage, supports a primary‑replica architecture, elastic scaling, high availability, and full MySQL compatibility. It handles over 100,000 instances and petabytes of data internally, while offering DBaaS to external customers with availability guarantees of ≥99.95% in a single zone and ≥99.99% in multi‑zone deployments.
Diagnosis Process
The platform enforces a three‑step "MTTI‑MTTK‑MTTR" workflow to reduce mean time to repair. Problem discovery relies on alarm notifications, periodic inspections, and manual fault reports. Root‑cause localization uses four‑layer monitoring metrics, system logs, and real‑time state snapshots, combining manual analysis with an intelligent RCA engine. Fault mitigation depends on rapid detection and, for complex issues, fallback mechanisms to limit impact.
Problem Scenarios
Issues are classified by type (resource bottlenecks, latency spikes, system anomalies) and urgency (daily operations, customer‑critical protection, emergency). Resource bottlenecks cover CPU, memory, lock contention, network or I/O queues. Latency problems include overall slowdown or p99 spikes. System anomalies involve unexpected traffic distribution, failover behavior, or connection failures.
Diagnostic Techniques
Full‑SQL Insight records every SQL statement with configurable granularity (30 s by default, up to 1 s for critical paths) while keeping overhead under 5 %. Logs are written to shared memory, pre‑fetched, encoded, and flushed asynchronously, allowing three‑year retention and terabyte‑scale storage per instance.
End‑to‑End SQL Latency Analysis tags each request with a global trace ID, collecting per‑component timing (Proxy, DBEngine, logstore, pagestore). The trace enables pinpointing slow paths across distributed components, as illustrated by the detailed execution plan shown in the code block.
mySQL> select * from (select avg(A.id), B.a from t1 A join t1 B on A.id = B.id+1 group by B.a) T where T.a > "a1";Intelligent Alarm Diagnosis standardizes SOPs for each alarm scenario and automates them on the SpaceX diagnostic platform. The workflow gathers metrics, performs comparative calculations, and invokes a large‑model AI to infer root causes, producing both the diagnosis and the reasoning steps.
AI‑Enhanced Protection Monitoring adds periodic AI‑driven trend analysis for high‑risk customers (e.g., new cloud migrations or major promotions). Real‑time checks use tighter thresholds, while AI predicts future anomalies based on historical patterns.
Second‑Level Metric Monitoring introduces 1‑second and 5‑second sampling for critical metrics, layered on top of the existing minute‑level collection. Tiered activation ensures minimal performance impact while providing the granularity needed for extreme degradation scenarios.
Conclusion
The article outlines VeDB's comprehensive diagnostic ecosystem: a structured discovery‑localization‑mitigation pipeline, scenario‑based classification, full‑SQL and trace‑based latency tools, AI‑augmented alarm handling, and fine‑grained monitoring. Together these techniques achieve rapid, accurate fault isolation and support the reliability demands of massive, latency‑sensitive services.
Volcano Engine Developer Services
The Volcano Engine Developer Community, Volcano Engine's TOD community, connects the platform with developers, offering cutting-edge tech content and diverse events, nurturing a vibrant developer culture, and co-building an open-source ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
