Zero‑Code Full‑Stack Observability with OpenTelemetry eBPF: CloudMonitor 2.0’s In‑Kernel “Lens”
OpenTelemetry eBPF Instrumentation (OBI) injects a kernel‑level, zero‑code probe that automatically captures OpenTelemetry‑compatible traces, metrics, and logs for over 15 protocols—including HTTP, gRPC, MySQL, Redis, Kafka, and CUDA—while handling cross‑language context propagation, GPU tracing, and seamless integration with CloudMonitor 2.0.
In heterogeneous cloud‑native environments where services run in Go, Java, Python, Node.js, and other runtimes across containers, Kubernetes, and serverless, traditional APM requires per‑language agents and code changes. OBI (OpenTelemetry eBPF Instrumentation) solves this by mounting a sandboxed eBPF probe in the Linux kernel, enabling zero‑code, cross‑language observability that conforms to the OpenTelemetry standard.
OBI intercepts network traffic at the kernel level, parses protocol semantics, and emits trace and metric data for more than 15 protocols (HTTP, gRPC, MySQL, Redis, Kafka, MQTT, PostgreSQL, etc.) as well as GPU/CUDA operations. It also includes built‑in GenAI tracing for OpenAI, Anthropic, Google Gemini, and Alibaba Qwen, automatically extracting tool‑call information and vector‑retrieval events.
The core detection logic lives in ReadTCPRequestIntoSpan (pkg/ebpf/common/tcp_detect_transform.go). It performs a three‑stage waterfall match: (1) kernel‑assigned protocol constants (e.g., MySQL=1, Postgres=2, Kafka=4), (2) deterministic generic matching (functions like matchSQL, matchFastCGI, matchMongo), and (3) heuristic fallback (e.g., detectHeuristicProtocol). Ordering is critical; HTTP/2 is placed after MQTT to avoid false positives.
For languages without native thread‑local storage, OBI reconstructs execution context inside the kernel. In Go, it builds a goroutine parent‑child tree using runtime.newproc1, stores relationships in an LRU map, and walks up to six levels to link outbound calls to the originating inbound request. In Python asyncio, it tracks Task and Context objects via uprobe hooks ( task_step, _asyncio_Task___init__, PyContext_CopyCurrent), handling both normal and asyncio.to_thread off‑loaded work.
Cross‑process trace‑parent propagation for non‑Go languages is performed uniformly in kernel space by the tpinjector (pkg/internal/ebpf/tpinjector). It injects a custom TCP option (kind 25) or HTTP header ( Traceparent) using functions like WRITE_HDR_OPT and bpf_store_hdr_opt, falling back to header injection when the TCP option is stripped by middleboxes.
OBI’s user‑space pipeline is a DAG built with the internal swarm framework. The top‑level RunWithContextInfo creates three independent goroutine agents (application monitoring, network monitoring, log enrichment) managed by an errgroup. Each agent consists of an Instancer (initialization) and a Runner (execution). The pipeline uses a lock‑free, deadlock‑detecting fan‑out queue ( msg.Queue[T]) with built‑in bypass, timeout, and multi‑producer close semantics.
Data moves from the kernel to user space via a double‑goroutine ring‑buffer forwarder ( ringBufForwarder[T]) that separates reading ( readerLoop → ReadInto) from parsing ( parserLoop → Parse). An object pool (size = 2 × BatchLength) avoids GC pressure, while batch flushing is triggered by size or a 1‑second ticker. The shared ring buffer ( SharedRingBuffer) serves all traced processes, and graceful shutdown walks the DAG, cancelling all agents if any initialization fails.
Beyond tracing, OBI provides network‑level observability (TC‑based L3/L4 packet capture, GeoIP, reverse DNS, CIDR tagging) and statistics (TCP RTT, connection failures) exposed as Prometheus metrics ( obi_stat_tcp_rtt_seconds, obi_stat_tcp_failed_connections). Log enrichment hooks tty_write and pipe_write to inject trace_id and span_id into JSON logs without modifying application code.
Deployment requires Linux 5.8+ (or 4.18+ on RHEL) on amd64/arm64, and can run as a standalone process, Docker container, or Kubernetes DaemonSet. Propagation mode is configurable via OTEL_EBPF_BPF_CONTEXT_PROPAGATION (headers, TCP option, or disabled). The solution integrates with Alibaba Cloud’s CloudMonitor 2.0, enabling one‑click observability for existing workloads.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Observability
Driving continuous progress in observability technology!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
