How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes
At KubeCon China 2023, experts Liu Kai and Dong Shandong presented a three‑part deep dive into Kubernetes observability challenges, demonstrating how eBPF enables comprehensive data collection across all stack layers, seamless integration, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and fault‑tree methods.
Observability Challenges in Kubernetes
Based on the KubeCon China 2023 talk, the presenters highlighted three major challenges revealed by analyzing over 1,000 Kubernetes tickets: infrastructure issues (especially network problems accounting for more than 56% of incidents), the need to collect data from multiple layers, and the difficulty of using the collected data for troubleshooting.
Data Collection with eBPF
eBPF, a virtual machine running in the Linux kernel, allows custom logic to be loaded without recompiling the kernel or restarting applications. By writing eBPF programs, compiling them to bytecode, and attaching them via bpf system calls to various hook points (system calls, kernel functions, or user‑space code), developers can gather runtime information with zero‑intrusion, high performance, and strong safety guarantees.
Key characteristics of eBPF:
Non‑intrusive: dynamic attachment without process restart.
High performance: JIT‑compiled to native code.
Secure: runs in a sandbox verified by a strict verifier.
Architecture Awareness
Using eBPF, the team built an “architecture awareness” capability that automatically discovers the service topology, runtime status, and network flows of a Kubernetes cluster. By instrumenting kernel functions such as
netif_receive_skband
dev_queue_xmit, they can count packets, measure sizes, and assess network quality (e.g., packet loss, retransmissions).
Application Performance Monitoring
Traditional APM probes are tied to specific RPC libraries and languages. By attaching eBPF programs at system‑call level (e.g.,
read,
write,
sendto,
recvfrom), the solution captures request/response data independent of language or framework, parses protocols (e.g., HTTP) to extract method, path, status code, and computes latency.
Fault Localization Practice
The presenters demonstrated a manual troubleshooting workflow in ARMS Kubernetes monitoring: start from an alert on a gateway service’s response time, follow the service topology to identify the downstream service (product service) with similar anomalies, and inspect network metrics (packet retransmissions, RTT) on the gateway‑to‑product link.
They then described how to automate this process: check golden metrics of the gateway, traverse downstream nodes, correlate network indicators, and finally enrich the analysis with log pattern recognition to produce a diagnostic report.
Root‑Cause Analysis Steps
The automated workflow consists of three core steps: dimension attribution (drilling down into metrics by service, region, host, etc.), anomaly bounding (identifying abnormal values), and Fault Tree Analysis (FTA) to classify the failure type. By combining horizontal (service‑to‑service) and vertical (service‑to‑resource) attribution, the system can pinpoint the exact node, link, and root cause.
Conclusion
The talk concluded that eBPF enables a unified, low‑overhead observability stack for Kubernetes, covering data collection, correlation, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and FTA, and that these techniques are being integrated into the Insights product for multi‑scenario monitoring.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.