How Minsheng Bank Built eBPF‑Based Observability for Cloud‑Native Services
The article details Minsheng Bank's step‑by‑step journey from traditional network monitoring to a full‑stack, zero‑intrusion observability platform built with DeepFlow, vTap, distributed data collection, and eBPF, illustrating concrete case studies and future plans for expanding business‑level monitoring.
Background
Minsheng Bank needed to move from network‑centric troubleshooting to cloud‑native observability because business continuity demands required faster fault isolation.
Traditional flow analysis platform
~7‑8 years ago a platform mirrored traffic via switch port mirroring, filtered and labeled it, and fed it to monitoring systems, providing data services to transaction, security and big‑data platforms.
Observability evolution
Phase 1 – vTap‑based traffic distribution
Deployed DeepFlow collectors on compute nodes to capture east‑west traffic inside containers and VMs using vTap. Traffic was encapsulated in VxLAN tunnels and sent to a unified aggregation platform, covering the virtual‑network blind spot.
Phase 2 – Distributed data collection
Raw container traffic saturated the aggregation layer. Introduced in‑collector processing: metrics, logs and trace data are computed locally, then only structured data is sent upstream. This reduced bandwidth pressure and enabled full‑path tracing across four TCP points (client pod NIC → server pod NIC).
Phase 3 – eBPF application observability
Leveraged eBPF to capture zero‑intrusion application‑level data (function calls, metrics, logs) for services such as Nginx, DNS and Redis. Provided call‑chain tracing, flame‑graph analysis and CPU profiling, extending observability from network to system and application layers.
Phase 4 – Data‑plane unification and exploration
Ingested Prometheus, SkyWalking and Tingyun metrics to build a unified data foundation. Explored WebAssembly‑based deep‑packet and system‑call decoding to expose business‑level fields (transaction IDs, response codes).
Case studies
Web service latency of 3.04 s traced to a backend oms‑app pod via eBPF call‑chain tracing.
Retail service latency ~500 ms identified as a slow acuiagwapp service.
HTTP 502 error traced to a DNS resolution failure in the service‑side DNS request.
High‑frequency DNS queries from tpp‑pay‑* pods revealed inefficient kube‑dns configuration, prompting DNS optimization.
Benefits
Full‑stack, proactive observability reduces fault‑location time, provides data services for network, system and application teams, and supports faster root‑cause analysis.
Future work
Plan to achieve 100 % kernel‑version coverage for eBPF by 2024, integrate deeper APM data from SkyWalking and Tingyun, and extend business‑transaction monitoring using WebAssembly‑based packet and syscall decoding.
DeepFlow
DeepFlow is an open‑source observability product that uses eBPF for zero‑code metric, trace and log collection and smart tagging for universal correlation. Repository: https://github.com/deepflowio/deepflow.
Linux Code Review Hub
A professional Linux technology community and learning platform covering the kernel, memory management, process management, file system and I/O, performance tuning, device drivers, virtualization, and cloud computing.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
