Artificial Intelligence 19 min read

Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services

This article explains how China Mobile built a hybrid‑cloud production environment for its customer‑service LLM, using eBPF and WebAssembly plugins from DeepFlow to achieve zero‑intrusion observability, automatically capture full‑stack topology, application/network metrics, and key LLM business indicators such as TTFT, TPOT, and token throughput.

Linux Kernel Journey

Nov 14, 2024

Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services

Observability requirements for LLM services

Large language model (LLM) deployments hide training, fine‑tuning and inference details, making response latency, accuracy and data security appear as black boxes. Business teams forbid installing traditional APM probes to avoid traffic impact, and multiple teams manage different models, complicating unified topology and metric coordination. A solution therefore must be zero‑intrusive.

Zero‑intrusive observability with DeepFlow eBPF

DeepFlow, integrated with the Pan‑Base PaaS platform, uses eBPF to provide out‑of‑the‑box, full‑stack application observability without code changes. It merges eBPF data with existing metrics to deliver call‑graph tracing, application and network metrics, and detailed request/response information.

Out‑of‑the‑box features

Full‑stack topology – agents deployed on the customer‑service LLM support system and the underlying base model automatically render a panoramic call graph across regions, clouds and container clusters.

Application and network metrics – CPU, memory, disk I/O, system load, throughput, TCP retransmission ratio, connection latency and active connection counts are generated for every service without code injection.

Request/response logs – each request and its corresponding response, together with underlying network flows, are recorded for deep troubleshooting.

Full‑stack tracing – end‑to‑end call chains from the customer‑service LLM to the base model are visualized, covering processes, containers and ingress gateways.

Business‑level metric schema

Infrastructure metrics : CPU usage, memory usage, disk I/O, system load.

Network metrics : throughput, TCP retransmission ratio, TCP connection‑failure ratio, connection latency, active connections.

Application metrics : request rate, server‑side error ratio, response latency.

LLM‑specific metrics :

TTFT (Time To First Token) – time from request send to first token response.

TPOT (Time Per Output Token) – (finish‑response time – first‑token time) / total token count.

Token production rate – total token bytes divided by the interval between first and finish responses.

Request rate – total LLM requests per second.

Request latency – average time per LLM request.

Service concurrency – number of simultaneous long‑lived LLM connections.

Parsing LLM traffic with a DeepFlow Wasm plugin

The Wasm plugin runs in a sandboxed environment, parses HTTP chunked‑transfer streams, and extracts timestamps needed to compute TTFT, TPOT and token production rate.

Define StreamInfo struct and a map to store per‑stream data.

Detect streaming requests by checking whether the HTTP path contains /generate_stream.

During payload parsing, distinguish request and response directions:

For requests, record the request timestamp.

For responses, read chunked data, identify the first token chunk, record its timestamp and token count, and on the final “0” chunk compute TTFT and TPOT.

Emit the calculated key‑value pairs ( ttft and tpot) as L7 protocol information.

Initialize the plugin in main(), set the parser, and log a load message.

Core plugin code

type llmParser struct { httpStream map[uint64]*StreamInfo }

type StreamInfo struct {
    reqTime               uint64 // request time
    respFirstChunkedTime  uint64 // first chunk response time
    totalToken            uint64 // total output token count
    flag                  int    // whether first chunk time has been recorded
}

func checker(payload []byte) (protoNum uint8, protoStr string) {
    req, err := http.ReadRequest(bufio.NewReader(bytes.NewReader(payload)))
    if err != nil { return 0, "" }
    if strings.Contains(req.URL.Path, "/generate_stream") {
        sdk.Warn(fmt.Sprintf("check: %s", req.URL.Path))
        return 1, "http_stream"
    }
    return 0, ""
}

func (p *llmParser) HookIn() []sdk.HookBitmap {
    return []sdk.HookBitmap{sdk.HOOK_POINT_PAYLOAD_PARSE}
}

func (p *llmParser) OnCheckPayload(baseCtx *sdk.ParseCtx) (uint8, string) {
    if baseCtx.EbpfType != sdk.EbpfTypeNone { return 0, "" }
    payload, err := baseCtx.GetPayload()
    if err != nil { return 0, "" }
    if baseCtx.Direction == sdk.DirectionRequest {
        return checker(payload)
    }
    return 0, ""
}

func (p *llmParser) OnParsePayload(baseCtx *sdk.ParseCtx) sdk.Action {
    if baseCtx.L7 != 1 { return sdk.ActionNext() }
    payload, err := baseCtx.GetPayload()
    if err != nil { return sdk.ActionAbortWithErr(err) }
    flowId := baseCtx.FlowID
    if p.httpStream[flowId] == nil { p.httpStream[flowId] = &StreamInfo{} }
    switch baseCtx.Direction {
    case sdk.DirectionRequest:
        req, err := http.ReadRequest(bufio.NewReader(bytes.NewReader(payload)))
        if err != nil { return sdk.ActionNext() }
        p.httpStream[flowId].reqTime = baseCtx.Time
        info := &sdk.L7ProtocolInfo{Req: &sdk.Request{Resource: req.URL.Path}, Resp: &sdk.Response{}}
        return sdk.ParseActionAbortWithL7Info([]*sdk.L7ProtocolInfo{info})
    case sdk.DirectionResponse:
        r := bufio.NewReader(bytes.NewReader(payload))
        bs, _, err := r.ReadLine()
        if err == io.EOF { return sdk.ActionNext() }
        if regexp.MustCompile(`^HTTP/[1-2]\.[01] \d{3} .*`).MatchString(string(bs)) { return sdk.ActionNext() }
        if string(bs) == "0" {
            attr := []sdk.KeyVal{{Key: "ttft", Val: fmt.Sprintf("%d", p.httpStream[flowId].respFirstChunkedTime-p.httpStream[flowId].reqTime)},
                                 {Key: "tpot", Val: fmt.Sprintf("%d", (baseCtx.Time-p.httpStream[flowId].respFirstChunkedTime)/p.httpStream[flowId].totalToken)}}
            info := &sdk.L7ProtocolInfo{Req: &sdk.Request{}, Resp: &sdk.Response{}, Kv: attr}
            delete(p.httpStream, flowId)
            return sdk.ParseActionAbortWithL7Info([]*sdk.L7ProtocolInfo{info})
        }
        // first token chunk
        if p.httpStream[flowId].flag == 0 {
            p.httpStream[flowId].flag = 1
            p.httpStream[flowId].respFirstChunkedTime = baseCtx.Time
            p.httpStream[flowId].totalToken = uint64(len(bs))
            return sdk.ActionNext()
        }
        // subsequent chunks
        p.httpStream[flowId].totalToken += uint64(len(bs))
        return sdk.ActionNext()
    default:
        return sdk.ActionNext()
    }
}

func main() {
    sdk.Warn("llm wasm plugin loaded")
    llm := &llmParser{httpStream: map[uint64]*StreamInfo{}}
    sdk.SetParser(llm)
}

Compilation and deployment

tinygo build -o llm.wasm -target wasi -gc=precise -panic=trap -scheduler=none -no-debug

deepflow-ctl plugin create --type wasm --image llm.wasm --name llm ./llm/llm.go

deepflow-ctl plugin list

The WASM binary is uploaded to the DeepFlow server and distributed to agents without restarting the target service.

Grafana visualization

Collected metrics are displayed in Grafana dashboards integrated with the Pan‑Base UI and the panoramic topology view. TTFT, TPOT and token production rate enable rapid identification of performance bottlenecks and support billing monitoring for external models.

Roadmap

Prompt input/output tracing and token consumption monitoring.

Function‑level GPU/HBM performance profiling for training and fine‑tuning.

Millisecond‑level RDMA communication analysis.

Real‑time heterogeneous GPU performance indicators.

Reference

DeepFlow Wasm plugin documentation: https://deepflow.io/docs/zh/integration/process/wasm-plugin/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

LLM Observability Wasm Metrics eBPF Grafana DeepFlow

Written by

Linux Kernel Journey

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.