Cloud Native 18 min read

How to Achieve Service‑Level NAS Traffic Tracing with eBPF and Kubernetes

This article explains how to design and implement a service‑level NAS traffic tracing solution using Linux eBPF, NFS kernel hooks, and Kubernetes metadata to correlate container processes with NAS devices, generate real‑time metrics, and visualize them in Prometheus dashboards.

DeWu Technology
DeWu Technology
DeWu Technology
How to Achieve Service‑Level NAS Traffic Tracing with eBPF and Kubernetes

Background

NAS is a distributed file system that provides shared, elastic, high‑performance storage for thousands of compute nodes and is widely used in algorithm training and application deployment. However, NAS vendors only offer host‑level tracing, making it difficult to pinpoint which service caused an anomaly.

Flow Tracing Research and Validation

NAS Working Principle

NAS is built on NFS (primarily v3). Clients mount remote directories via NFS, and file operations are sent over RPC to the NAS server.

NFS File System Read/Write Process

In the Linux kernel, NFS operations are defined in struct file_operations nfs_file_operations. A read request triggers nfs_file_read, which calls nfs_initiate_read to send an RPC request. The response is handled by rpc_task_end and nfs_page_read_done. Because the kernel uses the host IP for all requests, the NAS server can only identify the host, not the specific container or process.

Mapping Process to Container Context

Each process has a PID and a struct task_struct that contains a cgroup pointer. The cgroup name includes the Docker container ID (e.g., docker-2b3b0ba12e92...983.scope), which can be parsed to obtain the Pod namespace, Pod name, and container name.

Parsing Mount Information

The mount command provides a device ID (major:minor, e.g., 0:660) and the remote NAS address. By linking the device ID from kernel tracepoints ( dev field) with the mount record, we can associate a NAS address with a kernel event.

Architecture Design and Implementation

Overall Architecture

The solution consists of three parts: a kernel‑space eBPF program that attaches to NFS tracepoints, a user‑space collector that reads eBPF maps, and a metrics exporter that pushes data to Prometheus.

Kernel eBPF Program Flow

SEC("tracepoint/nfs/nfs_initiate_read")
int tp_nfs_init_read(struct trace_event_raw_nfs_initiate_read *ctx) {
    dev_t dev_id = BPF_CORE_READ(ctx, dev); // NAS device ID (e.g., 660)
    u64 file_id = BPF_CORE_READ(ctx, fileid);
    u32 count = BPF_CORE_READ(ctx, count);
    struct task_struct *task = (struct task_struct *)bpf_get_current_task();
    const char *cname = BPF_CORE_READ(task, cgroups, subsys[0], cgroup, kn, name);
    if (cname) {
        bpf_core_read_str(&info.container, MAX_PATH_LEN, cname);
    }
    bpf_map_update_elem(&link_begin, &tid, &info, BPF_ANY);
}

SEC("tracepoint/nfs/nfs_readpage_done")
int tp_nfs_read_done(struct trace_event_raw_nfs_readpage_done *ctx) {
    // ... omitted ...
}

SEC("tracepoint/sunrpc/rpc_task_begin")
int tp_rpc_task_begin(struct trace_event_raw_rpc_task_running *ctx) {
    // ... omitted ...
}

SEC("tracepoint/sunrpc/rpc_task_end")
int tp_rpc_task_done(struct trace_event_raw_rpc_task_running *ctx) {
    // ... omitted ...
}

User Space Program Architecture

The collector continuously reads events from eBPF maps, extracts dev_id and container_id, looks up the NAS address from a mount‑info cache and the Pod metadata from a container‑info cache, then composes a metric key and value.

func (m *BPFEventMgr) ProcessIOMetric() {
    events := m.ioMetricMap
    iter := events.Iterate()
    for iter.Next(&nextKey, &event) {
        devId := nextKey.DevId
        mountInfo, ok := m.mountMgr.Find(int(devId))
        containerId := getContainerID(nextKey.Container)
        podInfo, ok := m.criMgr.Find(containerId)
        metricKey, metricValue := formatMetricData(nextKey, mountInfo, podInfo)
        metricCache.Store(metricKey, metricValue)
    }
    // Export to Prometheus
}

Metadata Caches

Mount cache parses /proc/self/mountinfo to map dev_id to NAS address. Container cache reads Docker/Containerd metadata to map short container IDs to Pod namespace, name, and container name.

type MountInfo struct {
    DevID       int
    RemoteDir   string
    LocalMountDir string
    NASAddr     string
}

type PodInfo struct {
    NameSpace    string
    PodName      string
    ContainerName string
    UID          string
    ContainerID  string
}

Custom Metrics Exporter

A Prometheus collector implements Describe and Collect, exposing a gauge metric nfs_io_metric with labels nfs_server, ns, pod, container, op, type.

type Collector interface {
    Describe(chan<- *Desc)
    Collect(chan<- Metric)
}

func (m *MetricMgr) Describe(ch chan<- *prometheus.Desc) {
    ch <- m.nfsIOMetric
}

func (m *MetricMgr) Collect(ch chan<- prometheus.Metric) {
    m.activeMutex.Lock()
    defer m.activeMutex.Unlock()
    for _, v := range m.ioMetricCounters {
        ch <- prometheus.MustNewConstMetric(m.nfsIOMetric, prometheus.GaugeValue, v.Count, v.Labels...)
    }
}

Summary

The implemented solution provides real‑time, service‑level NAS traffic tracing, exposing IOPS, throughput, and latency metrics per task and per Pod. Dashboards allow filtering by environment and NAS address, reducing incident diagnosis time from hours to minutes and supporting bandwidth‑level optimization.

observabilityKubernetesMetricseBPFNFSNAS
DeWu Technology
Written by

DeWu Technology

A platform for sharing and discussing tech knowledge, guiding you toward the cloud of technology.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.