DeeTune: Baidu’s eBPF‑Based Cloud‑Native Network Framework for Service Topology, Traffic Recording, and Non‑Intrusive Monitoring
DeeTune is Baidu’s eBPF‑based cloud‑native network framework that automatically builds complete service topologies, records configurable inter‑service traffic, and provides non‑intrusive metric monitoring with minimal CPU and memory overhead, enabling efficient fault localization and performance analysis across heterogeneous PaaS and container environments.
With the continuous evolution of cloud computing, Baidu’s internal services have gradually migrated to cloud environments, achieving obvious cost and efficiency benefits. However, gaps in observability have emerged, and traditional approaches that rely on code injection face challenges such as business intrusion, coordination overhead, performance impact, and stability risks. This article introduces DeeTune, a Baidu‑built network framework based on eBPF, which provides service topology construction, traffic recording, and non‑intrusive metric monitoring, thereby improving SRE and quality‑assurance efficiency.
Because Baidu’s micro‑service ecosystem is large and continuously growing, the dependency relationships among services are extremely complex. A global service topology is essential for visualizing call relationships, attaching monitoring information, and supporting fault‑localization, stability assurance, and infrastructure planning. Traditional SDK‑ or framework‑based methods require invasive code changes and struggle with multi‑technology‑stack environments.
eBPF (extended Berkeley Packet Filter) is a kernel‑level programmable engine that is independent of user‑space stacks. It offers a safe, stable API, high execution efficiency through JIT compilation, hot‑loading without kernel reboot, and data exchange via maps. These characteristics make eBPF an ideal solution for non‑intrusive observability, tracing, security, and high‑performance networking.
DeeTune consists of five subsystems: Agent —deployed as a host agent, loads eBPF programs to monitor process creation, TCP connections, socket I/O, and generates topology, metrics, and trace data; Server —an independent OpenTelemetry Collector that receives, parses, and stores observability data; Storage —provides dedicated storage for topology, recorded traffic, and trace information; CProm —Baidu’s internal Prometheus/Grafana integration for large‑scale metric queries; and API & Web UI —exposes OpenAPI and a visual interface for users to access and operate the platform.
Implementing DeeTune in Baidu’s heterogeneous environment required solving several difficulties: supporting multiple PaaS platforms, three container types (Matrix, Container, Docker), and both x86 and ARM A64 CPU architectures; ensuring the Agent’s resource consumption stays low (optimized to ~1.3 CPU cores and 1 GiB memory in production); and handling high event rates (tens of thousands to hundreds of thousands of kernel events per second) with an average eBPF processing overhead of about 30 µs per event, which is negligible for most services.
The framework delivers three core capabilities: Service Topology —accurate, complete topology data that supports fault localization, dependency analysis, cross‑region call tracing, and is exposed via OpenAPI; Traffic Recording —configurable recording tasks based on topology, allowing selective capture of traffic between any two services with policies on duration, count, and interfaces; and Metric Monitoring —collection of host and container resources (CPU, memory) as well as deep network and process metrics (active/failed connections, retransmissions, protocol‑level statistics), enabling operators to diagnose resource‑related issues efficiently.
DeeTune follows best‑practice examples from the eBPF ecosystem, including Facebook’s Katran load balancer, Netflix’s extensive eBPF tracing, Google’s use of Cilium in GKE, Apple’s Falcon security tool, AWS’s RPC observability, Alibaba’s Terway and iLogtail enhancements, as well as open‑source projects such as Bpftrace, BCC, Cilium, DeepFlow, and Coroot.
Future work will extend protocol support beyond HTTP1, Redis, and MySQL to gRPC, bRPC, HTTP2, and other internal protocols, and will deepen integration with deployment platforms, monitoring systems, CI pipelines, and quality‑efficiency tools to further streamline issue detection and resolution.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.