Cloud Native 11 min read

How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

HUATUO, Didi's open‑source cloud‑native observability project, leverages BPF‑based low‑overhead kernel tracing, unified metric and event frameworks, automatic flame‑graph generation, and seamless integration with Prometheus, Grafana and Elasticsearch to provide panoramic, zero‑intrusive monitoring and continuous performance profiling for complex production environments.

Didi Tech
Didi Tech
Didi Tech
How HUATUO Revolutionizes Cloud‑Native Observability with Zero‑Impact BPF Tracing

Introduction

On August 2, Didi announced that its open‑source cloud‑native observability project HUATUO has joined the China Computer Federation (CCF) as a key incubation project, reflecting Didi’s commitment to open‑source collaboration and aiming to standardize observability infrastructure for cloud‑native operating systems.

Project Overview

HUATUO (named after the ancient physician Hua Tuo) is a Didi‑developed, open‑source project that tackles missing fault‑site information, difficulty reproducing issues, and high diagnostic costs in cloud‑native environments. It uses BPF‑based dynamic tracing (kprobe, tracepoint, ftrace) to provide low‑overhead, zero‑intrusion, multi‑dimensional kernel observation, including fine‑grained metrics, abnormal context capture, automatic system‑spike tracing, and multi‑language continuous profiling. It is already deployed at scale in Didi’s production systems.

Technical Architecture

The system consists of a low‑level data collector and an apiserver. The collector aggregates kernel events via BPF, maps them to container/pod metadata, and forwards unified observations to the apiserver, which exposes Prometheus‑compatible metrics and provides APIs for custom extensions.

Core Features

Unified metric framework : compatible with Prometheus, exposing a simple Go interface for developers to add new metrics.

Event framework : lightweight kernel‑event handling with a minimal custom interface for extending event types.

Task tracing management : manages the full lifecycle of tracing tasks from AutoTracing and the apiserver.

Unified storage interface : supports local filesystem, Elasticsearch, and Amazon S3, abstracting storage details from developers.

Re‑designed BPF management : uses BPF object files to decouple business logic from kernel implementation.

Flame‑graph generation : produces flame graphs for CPU, memory, network I/O, and other performance bottlenecks.

Kernel‑container correlation : links kernel data structures with container information via cgroup ID, CSS, and container ID.

Low‑Loss Kernel Panoramic Observation

Traditional procfs metrics are coarse‑grained and cannot answer “who, where, why”. HUATUO inserts fine‑grained probes in slow‑path and exception‑path code, captures critical context, and implements a rate‑limiting mechanism to keep overhead below 1% while providing dynamic degradation control for high‑frequency anomalies.

Exception‑Driven Diagnostics

HUATUO focuses on kernel process and interrupt contexts, tracking events such as network, scheduling, I/O, and memory anomalies. When critical events like page faults, scheduling delays, or lock contention occur, the system automatically captures registers, stack traces, and resource usage to generate a diagnostic graph.

Fully Automated Tracing (AutoTracing)

AutoTracing addresses high‑dimensional spikes in system and business metrics (e.g., cpusys, cpuidle, IO, load‑avg). It automatically captures call stacks and generates flame graphs, using heuristic algorithms to pinpoint root causes of performance spikes.

Continuous Performance Profiling

HUATUO provides a language‑agnostic, zero‑intrusion profiling stack that continuously records CPU, memory, I/O, lock contention, and other resources across processes, threads, containers, and kernel subsystems, enabling long‑term performance analysis and regression detection.

Open‑Source Ecosystem Integration

HUATUO seamlessly integrates with mainstream observability stacks such as Prometheus, Grafana, Pyroscope, and Elasticsearch, supporting both bare‑metal and cloud‑native deployments. It automatically discovers Kubernetes resources, tags, and annotations, correlating kernel events with higher‑level metrics to eliminate data silos.

Deployment Scenarios

HUATUO is used in Didi’s ride‑hailing core services for link‑stress testing, fire‑drill simulations, holiday traffic surges, performance profiling across clusters, pre‑release gray‑box testing, and daily fault isolation, significantly reducing manual troubleshooting effort.

Future Outlook

In collaboration with CCF, Didi will accelerate HUATUO’s iteration, focusing on kernel‑level performance analysis and distributed tracing, delivering a zero‑intrusive, programmable monitoring solution for various industries while fostering an open‑source research‑industry ecosystem.

distributed-systemscloud-nativeobservabilityperformance profilingOpen-sourceBPFkernel tracing
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.