Operations 20 min read

How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes

At KubeCon China 2023, experts Liu Kai and Dong Shandong presented a three‑part deep dive into Kubernetes observability challenges, demonstrating how eBPF enables comprehensive data collection across all stack layers, seamless integration, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and fault‑tree methods.

Ops Development Stories
Ops Development Stories
Ops Development Stories
How eBPF Powers Next‑Gen Observability and Fault Diagnosis in Kubernetes

Observability Challenges in Kubernetes

Based on the KubeCon China 2023 talk, the presenters highlighted three major challenges revealed by analyzing over 1,000 Kubernetes tickets: infrastructure issues (especially network problems accounting for more than 56% of incidents), the need to collect data from multiple layers, and the difficulty of using the collected data for troubleshooting.

Data Collection with eBPF

eBPF, a virtual machine running in the Linux kernel, allows custom logic to be loaded without recompiling the kernel or restarting applications. By writing eBPF programs, compiling them to bytecode, and attaching them via bpf system calls to various hook points (system calls, kernel functions, or user‑space code), developers can gather runtime information with zero‑intrusion, high performance, and strong safety guarantees.

Key characteristics of eBPF:

Non‑intrusive: dynamic attachment without process restart.

High performance: JIT‑compiled to native code.

Secure: runs in a sandbox verified by a strict verifier.

Architecture Awareness

Using eBPF, the team built an “architecture awareness” capability that automatically discovers the service topology, runtime status, and network flows of a Kubernetes cluster. By instrumenting kernel functions such as

netif_receive_skb

and

dev_queue_xmit

, they can count packets, measure sizes, and assess network quality (e.g., packet loss, retransmissions).

Application Performance Monitoring

Traditional APM probes are tied to specific RPC libraries and languages. By attaching eBPF programs at system‑call level (e.g.,

read

,

write

,

sendto

,

recvfrom

), the solution captures request/response data independent of language or framework, parses protocols (e.g., HTTP) to extract method, path, status code, and computes latency.

Fault Localization Practice

The presenters demonstrated a manual troubleshooting workflow in ARMS Kubernetes monitoring: start from an alert on a gateway service’s response time, follow the service topology to identify the downstream service (product service) with similar anomalies, and inspect network metrics (packet retransmissions, RTT) on the gateway‑to‑product link.

They then described how to automate this process: check golden metrics of the gateway, traverse downstream nodes, correlate network indicators, and finally enrich the analysis with log pattern recognition to produce a diagnostic report.

Root‑Cause Analysis Steps

The automated workflow consists of three core steps: dimension attribution (drilling down into metrics by service, region, host, etc.), anomaly bounding (identifying abnormal values), and Fault Tree Analysis (FTA) to classify the failure type. By combining horizontal (service‑to‑service) and vertical (service‑to‑resource) attribution, the system can pinpoint the exact node, link, and root cause.

Conclusion

The talk concluded that eBPF enables a unified, low‑overhead observability stack for Kubernetes, covering data collection, correlation, and intelligent root‑cause analysis through dimension attribution, anomaly bounding, and FTA, and that these techniques are being integrated into the Insights product for multi‑scenario monitoring.

cloud-nativeObservabilitykuberneteseBPFFault Diagnosis
Ops Development Stories
Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.