Reflections on the 3rd eBPF Developer Conference: Harnessing eBPF for AI
The article recaps the 3rd eBPF Developer Conference in Xi'an, highlighting talks on BPF‑on‑MPTCP, system‑wide PGO, bperf, autonomous‑driving use cases, and AI‑driven observability, while sharing the author's insights on continuous profiling, SysOM, and future challenges of scaling eBPF with large models.
Morning Main Forum
Tang Geliang (Kylin Software) presented BPF on MPTCP , showing how eBPF can sense per‑path bandwidth and latency in Multipath TCP and dynamically allocate traffic to avoid congestion.
Ren Yuxin (openEuler) described a whole‑system PGO solution that uses eBPF to collect runtime performance data across the OS, enabling profile‑guided optimizations for the entire stack.
Liu Song (Meta) introduced bperf , an eBPF‑based enhancement to the Linux perf subsystem that reduces overhead and improves measurement accuracy.
Chen Tao (Didi) shared a case study of eBPF in autonomous‑driving, where eBPF agents monitor vehicle state, network connectivity, and sensor streams in real time to detect and mitigate anomalies.
Round‑Table Discussion
Experts from academia and industry (including professors from Xiyou, Huawei, Alibaba Cloud, and Didi) examined eBPF’s characteristics and its evolution under large‑model AI. They identified two open research directions:
System for AI : using eBPF to observe GPU/CPU faults and performance metrics during model training and inference.
AI for System : applying large language models to correlate business‑level KPIs with low‑level Linux indicators.
Challenges highlighted included integrating AI workloads with eBPF instrumentation and the difficulty of debugging eBPF programs.
Parallel Sessions
Four sub‑tracks covered eBPF trends, networking & security, observability, and performance engineering. Cheng Shuyi demonstrated Coolbpf and an AI‑driven flame‑graph that visualizes CPU and GPU call stacks together, enabling rapid bottleneck identification with Perfetto support.
SysOM Intelligent Operations Platform
SysOM provides continuous profiling for both CPU and GPU. It collects low‑overhead, high‑precision samples from user‑space and kernel stacks, stores them in a backend, and offers a UI for differential analysis across instances, models, and GPU cards. The platform extends the traditional observability pillars—logs, tracing, metrics—by adding continuous profiling that merges user‑space and kernel‑space insights.
Remaining Challenges and Proposed Solutions
Deploying profiling at scale faces three main issues:
Massive data volume.
High collection cost.
Non‑trivial overhead.
Proposed mitigations include centralizing symbol resolution, using large‑model‑driven adaptive sampling rates, and tuning network parameters to reduce per‑node data transfer.
Additional work on Perfetto optimizations for GPU profiling was noted.
References
SysOM AI flame‑graph and profiling details: https://mp.weixin.qq.com/s?__biz=MzkyMjM4MTcwOQ==∣=2247485451&idx=1&sn=3a76911f89a20368c5aade6d9357ed1b
SysOM observability system construction (Part 1): https://mp.weixin.qq.com/s?__biz=MzkyMjM4MTcwOQ==∣=2247485414&idx=1&sn=eee2c25c903ce5041b81fe3692031893
Video tutorial code repository: https://github.com/haolipeng/libbpf-ebpf-beginer/tree/master/src
Associated blog post: https://github.com/haolipeng/study_cloud_security_public/blob/master/ebpf%E5%AD%A6%E4%B9%A0/ebpf%E5%BC%80%E5%8F%91%E6%89%8B%E6%8A%8A%E6%89%8B%E6%95%99%E5%AD%A6/%E7%AC%AC%E4%B8%80%E8%AF%BE%20helloworld%E7%A8%8B%E5%BA%8F.md
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
