eBPF-based Service Interface Topology Observation and Validation in Didi's Observability Platform
Didi’s observability platform leverages non‑intrusive eBPF probes to automatically capture and validate service‑to‑service call tuples, supplement missing SDK data, achieve roughly 80 % core‑path coverage, and address verification challenges while planning future user‑space VM hooks and deeper MTL integration.
In the previous article we discussed Didi's observability practice focusing on the correlation between different observation signals. This article continues the discussion by exploring how service‑to‑service relationships are linked, the current popularity of eBPF, and its application within Didi.
Background
Didi's observability platform not only builds the MTL capability but also handles business‑side data and service interface call observations.
To illustrate interface call topology, a simple request‑response diagram is shown (image omitted). The notation [caller=A, caller-func=/a, callee=B, callee-func=/b] (shortened to [A, /a, B, /b] ) is used to describe how service A’s /a triggers calls to B’s /b and C’s /c . By aggregating enough call data and chaining entry points, important business call paths can be extracted.
Constructing these call paths is crucial for service stability, disaster recovery, capacity planning, and peak‑time health checks. In practice, interface‑level call topologies are far more effective than service‑level or host‑level topologies for troubleshooting and capacity assessment.
Business Problem: Validation of Service Interface Topology
Generated data lacks a verification method. Since the data is reported by business code via SDKs, the caller-func field can be missing or incorrect.
Verification and generation cost are high. Core call paths are relatively easy to modify, but non‑core or legacy paths are difficult to instrument, and manual enumeration of thousands of links is impractical.
These two issues are common when using metric‑based interface topology generation.
Solution Overview
Didi adopts a non‑intrusive eBPF (referred to as BPF) approach that combines metric collection with BPF‑based verification and missing‑data补充. The solution also explores deeper BPF usage such as MTL integration.
BPF Introduction
BPF originated from the Berkeley Packet Filter and was extended from kernel 3.15 onward. The original classic BPF (cBPF) was superseded by extended BPF (eBPF). eBPF supports various hook types (kprobe, uprobe, tracepoint, etc.) and can be programmed in restricted C, compiled to BPF bytecode, and loaded via system calls.
Key event types supported up to kernel 4.18 are illustrated (image omitted).
Typical BPF development uses kprobe and uprobe . Most kernel functions can be hooked with kprobe , while user‑space functions are hooked with uprobe . The following example shows a bpftrace script that observes /bin/bash readline calls:
#!/usr/bin/bpftrace
BEGIN {
printf("开始观测bash...\n使用Ctrl-C停止\n");
}
uretprobe:/bin/bash:readline {
printf("cmd: %s\n", str(retval));
}Running the script produces output such as:
$ sudo bpftrace ./bashreadline.bt
Attaching 2 probes...
开始观测bash...
使用Ctrl-C停止
cmd: ls -l
cmd: pwd
cmd: crontab -e
cmd: clearUsing BPF to Solve Service Interface Topology Issues
A simple Golang service (Go 1.16) is used to demonstrate the approach. The service defines two HTTP handlers, /handle and /echo . The four‑tuple for a request is [local, /handle, local, /echo] . The code is shown below:
func echo(c *gin.Context) {
c.JSON(http.StatusOK, &Resp{Errno: 0, Errmsg: "ok"})
return
}
func handle(c *gin.Context) {
client := http.Client{}
req, _ := http.NewRequest(http.MethodGet, "http://0.0.0.0:9932/echo", nil)
resp, err := client.Do(req)
if err != nil {
c.JSON(http.StatusOK, &Resp{Errno: 1, Errmsg: "failed to request"})
return
}
respB, err := ioutil.ReadAll(resp.Body)
if err != nil {
c.JSON(http.StatusOK, &Resp{Errno: 2, Errmsg: "failed to read request"})
return
}
defer resp.Body.Close()
fmt.Println("resp: ", string(respB))
c.JSON(http.StatusOK, &Resp{Errno: 0, Errmsg: "request okay"})
return
}A BPFtrace script collects the caller and callee information without modifying the service code:
uprobe:./http_demo:net/http.serverHandler.ServeHTTP {
$req_addr = sarg3;
$url_addr = *(uint64*)($req_addr+16);
$path_addr = *(uint64*)($url_addr+56);
$path_len = *(uint64*)($url_addr+64);
@caller_path_addr[pid] = $path_addr;
@caller_path_len[pid] = $path_len;
@callee_set[pid] = 0;
}
uprobe:./http_demo:"net/http.(*Client).do" {
printf("caller: \n caller_path: %s\n", str(@caller_path_addr[pid], @caller_path_len[pid]));
$req_addr = sarg1;
$addr = *(uint64*)($req_addr);
$len = *(uint64*)($req_addr + 8);
printf("callee: \n method: %s\n", str($addr, $len));
$url_addr = *(uint64*)($req_addr + 16);
$addr = *(uint64*)($url_addr + 40);
$len = *(uint64*)($url_addr + 48);
printf(" host: %s\n", str($addr, $len));
$addr = *(uint64*)($url_addr + 56);
$len = *(uint64*)($url_addr + 64);
printf(" url: %s\n\n", str($addr, $len));
@callee_set[pid] = 1;
}
uprobe:./http_demo:"net/http.(*response).finishRequest" {
if (@callee_set[pid] == 0){
printf("caller: \n caller_path: %s\n", str(@caller_path_addr[pid], @caller_path_len[pid]));
printf("callee: none\n\n");
@callee_set[pid] = 1;
}
}Running the collector yields output such as:
# 启动采集
$ bpftrace ./http.bt
Attaching 2 probes... # 未触发请求前,停止在这里
caller: # 触发请求后,输出
caller_path: /handle
callee:
method: GET
host: 0.0.0.0:9932
url: /echo
caller:
caller_path: /echo
callee: noneThis demonstrates that BPF can capture the four‑tuple of service calls without any code changes, highlighting BPF’s power for observability.
Practical Coverage and Effectiveness
In production, Didi’s solution covers Golang and PHP services with an approximate 80% coverage of core call paths. Compared with pure metric‑based data, BPF adds up to 20% new call edges for core paths.
Challenges with uprobe
Limited universality: uprobe hooks are tightly coupled to specific languages or frameworks and require symbol tables. If symbols are missing, uprobe cannot work.
Performance overhead: each uprobe incurs a user‑to‑kernel and kernel‑to‑user transition (~1 µs), which is higher than kprobe (~100 ns). High‑frequency hooks can noticeably affect target processes.
Despite these drawbacks, Didi prefers uprobe for its faster development cycle and lower implementation cost.
Future Directions
Large numbers of uprobe hooks (over 1500 per physical machine) raise stability concerns. Moving the hook execution to a user‑space VM could reduce latency. Additionally, integrating BPF with MTL (Metric‑Trace‑Log) fusion can automatically correlate metrics, logs, and traces when BPF maintains per‑request trace information.
long bpf_probe_write_user(void *dst, const void *src, u32 len)The helper bpf_probe_write_user enables writing data directly into user‑space memory from BPF, opening new possibilities for MTL fusion.
Finally, the article invites readers to share which eBPF problems they hope to solve and announces a giveaway for the most interesting comment.
Didi Tech
Official Didi technology account
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.