How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing
HUATUO, Didi's open‑source cloud‑native observability project, leverages BPF‑based low‑overhead kernel tracing, unified metric and event frameworks, automatic flame‑graph generation, and seamless integration with Prometheus, Grafana and Elasticsearch to provide panoramic, zero‑intrusive monitoring and continuous performance profiling for complex production environments.
Introduction
On August 2, Didi announced that its open‑source cloud‑native observability project HUATUO has joined the China Computer Federation (CCF) as a key incubation project, reflecting Didi’s commitment to open‑source collaboration and aiming to standardize observability infrastructure for cloud‑native operating systems.
Project Overview
HUATUO (named after the ancient physician Hua Tuo) is a Didi‑developed, open‑source project that tackles missing fault‑site information, difficulty reproducing issues, and high diagnostic costs in cloud‑native environments. It uses BPF‑based dynamic tracing (kprobe, tracepoint, ftrace) to provide low‑overhead, zero‑intrusion, multi‑dimensional kernel observation, including fine‑grained metrics, abnormal context capture, automatic system‑spike tracing, and multi‑language continuous profiling. It is already deployed at scale in Didi’s production systems.
Technical Architecture
The system consists of a low‑level data collector and an apiserver. The collector aggregates kernel events via BPF, maps them to container/pod metadata, and forwards unified observations to the apiserver, which exposes Prometheus‑compatible metrics and provides APIs for custom extensions.
Core Features
Unified metric framework : compatible with Prometheus, exposing a simple Go interface for developers to add new metrics.
Event framework : lightweight kernel‑event handling with a minimal custom interface for extending event types.
Task tracing management : manages the full lifecycle of tracing tasks from AutoTracing and the apiserver.
Unified storage interface : supports local filesystem, Elasticsearch, and Amazon S3, abstracting storage details from developers.
Re‑designed BPF management : uses BPF object files to decouple business logic from kernel implementation.
Flame‑graph generation : produces flame graphs for CPU, memory, network I/O, and other performance bottlenecks.
Kernel‑container correlation : links kernel data structures with container information via cgroup ID, CSS, and container ID.
Low‑Loss Kernel Panoramic Observation
Traditional procfs metrics are coarse‑grained and cannot answer “who, where, why”. HUATUO inserts fine‑grained probes in slow‑path and exception‑path code, captures critical context, and implements a rate‑limiting mechanism to keep overhead below 1% while providing dynamic degradation control for high‑frequency anomalies.
Exception‑Driven Diagnostics
HUATUO focuses on kernel process and interrupt contexts, tracking events such as network, scheduling, I/O, and memory anomalies. When critical events like page faults, scheduling delays, or lock contention occur, the system automatically captures registers, stack traces, and resource usage to generate a diagnostic graph.
Fully Automated Tracing (AutoTracing)
AutoTracing addresses high‑dimensional spikes in system and business metrics (e.g., cpusys, cpuidle, IO, load‑avg). It automatically captures call stacks and generates flame graphs, using heuristic algorithms to pinpoint root causes of performance spikes.
Continuous Performance Profiling
HUATUO provides a language‑agnostic, zero‑intrusion profiling stack that continuously records CPU, memory, I/O, lock contention, and other resources across processes, threads, containers, and kernel subsystems, enabling long‑term performance analysis and regression detection.
Open‑Source Ecosystem Integration
HUATUO seamlessly integrates with mainstream observability stacks such as Prometheus, Grafana, Pyroscope, and Elasticsearch, supporting both bare‑metal and cloud‑native deployments. It automatically discovers Kubernetes resources, tags, and annotations, correlating kernel events with higher‑level metrics to eliminate data silos.
Deployment Scenarios
HUATUO is used in Didi’s ride‑hailing core services for link‑stress testing, fire‑drill simulations, holiday traffic surges, performance profiling across clusters, pre‑release gray‑box testing, and daily fault isolation, significantly reducing manual troubleshooting effort.
Future Outlook
In collaboration with CCF, Didi will accelerate HUATUO’s iteration, focusing on kernel‑level performance analysis and distributed tracing, delivering a zero‑intrusive, programmable monitoring solution for various industries while fostering an open‑source research‑industry ecosystem.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
