Deep Dive: How DeepFlow Collects Business Metrics for Large‑Model Services
This article explains how China Mobile built a hybrid‑cloud production environment for its customer‑service LLM, using eBPF and WebAssembly plugins from DeepFlow to achieve zero‑intrusion observability, automatically capture full‑stack topology, application/network metrics, and key LLM business indicators such as TTFT, TPOT, and token throughput.
Observability requirements for LLM services
Large language model (LLM) deployments hide training, fine‑tuning and inference details, making response latency, accuracy and data security appear as black boxes. Business teams forbid installing traditional APM probes to avoid traffic impact, and multiple teams manage different models, complicating unified topology and metric coordination. A solution therefore must be zero‑intrusive.
Zero‑intrusive observability with DeepFlow eBPF
DeepFlow, integrated with the Pan‑Base PaaS platform, uses eBPF to provide out‑of‑the‑box, full‑stack application observability without code changes. It merges eBPF data with existing metrics to deliver call‑graph tracing, application and network metrics, and detailed request/response information.
Out‑of‑the‑box features
Full‑stack topology – agents deployed on the customer‑service LLM support system and the underlying base model automatically render a panoramic call graph across regions, clouds and container clusters.
Application and network metrics – CPU, memory, disk I/O, system load, throughput, TCP retransmission ratio, connection latency and active connection counts are generated for every service without code injection.
Request/response logs – each request and its corresponding response, together with underlying network flows, are recorded for deep troubleshooting.
Full‑stack tracing – end‑to‑end call chains from the customer‑service LLM to the base model are visualized, covering processes, containers and ingress gateways.
Business‑level metric schema
Infrastructure metrics : CPU usage, memory usage, disk I/O, system load.
Network metrics : throughput, TCP retransmission ratio, TCP connection‑failure ratio, connection latency, active connections.
Application metrics : request rate, server‑side error ratio, response latency.
LLM‑specific metrics :
TTFT (Time To First Token) – time from request send to first token response.
TPOT (Time Per Output Token) – (finish‑response time – first‑token time) / total token count.
Token production rate – total token bytes divided by the interval between first and finish responses.
Request rate – total LLM requests per second.
Request latency – average time per LLM request.
Service concurrency – number of simultaneous long‑lived LLM connections.
Parsing LLM traffic with a DeepFlow Wasm plugin
The Wasm plugin runs in a sandboxed environment, parses HTTP chunked‑transfer streams, and extracts timestamps needed to compute TTFT, TPOT and token production rate.
Define StreamInfo struct and a map to store per‑stream data.
Detect streaming requests by checking whether the HTTP path contains /generate_stream.
Register the payload‑parse hook at HOOK_POINT_PAYLOAD_PARSE.
During payload parsing, distinguish request and response directions:
For requests, record the request timestamp.
For responses, read chunked data, identify the first token chunk, record its timestamp and token count, and on the final “0” chunk compute TTFT and TPOT.
Emit the calculated key‑value pairs ( ttft and tpot) as L7 protocol information.
Initialize the plugin in main(), set the parser, and log a load message.
Core plugin code
type llmParser struct { httpStream map[uint64]*StreamInfo }
type StreamInfo struct {
reqTime uint64 // request time
respFirstChunkedTime uint64 // first chunk response time
totalToken uint64 // total output token count
flag int // whether first chunk time has been recorded
}
func checker(payload []byte) (protoNum uint8, protoStr string) {
req, err := http.ReadRequest(bufio.NewReader(bytes.NewReader(payload)))
if err != nil { return 0, "" }
if strings.Contains(req.URL.Path, "/generate_stream") {
sdk.Warn(fmt.Sprintf("check: %s", req.URL.Path))
return 1, "http_stream"
}
return 0, ""
}
func (p *llmParser) HookIn() []sdk.HookBitmap {
return []sdk.HookBitmap{sdk.HOOK_POINT_PAYLOAD_PARSE}
}
func (p *llmParser) OnCheckPayload(baseCtx *sdk.ParseCtx) (uint8, string) {
if baseCtx.EbpfType != sdk.EbpfTypeNone { return 0, "" }
payload, err := baseCtx.GetPayload()
if err != nil { return 0, "" }
if baseCtx.Direction == sdk.DirectionRequest {
return checker(payload)
}
return 0, ""
}
func (p *llmParser) OnParsePayload(baseCtx *sdk.ParseCtx) sdk.Action {
if baseCtx.L7 != 1 { return sdk.ActionNext() }
payload, err := baseCtx.GetPayload()
if err != nil { return sdk.ActionAbortWithErr(err) }
flowId := baseCtx.FlowID
if p.httpStream[flowId] == nil { p.httpStream[flowId] = &StreamInfo{} }
switch baseCtx.Direction {
case sdk.DirectionRequest:
req, err := http.ReadRequest(bufio.NewReader(bytes.NewReader(payload)))
if err != nil { return sdk.ActionNext() }
p.httpStream[flowId].reqTime = baseCtx.Time
info := &sdk.L7ProtocolInfo{Req: &sdk.Request{Resource: req.URL.Path}, Resp: &sdk.Response{}}
return sdk.ParseActionAbortWithL7Info([]*sdk.L7ProtocolInfo{info})
case sdk.DirectionResponse:
r := bufio.NewReader(bytes.NewReader(payload))
bs, _, err := r.ReadLine()
if err == io.EOF { return sdk.ActionNext() }
if regexp.MustCompile(`^HTTP/[1-2]\.[01] \d{3} .*`).MatchString(string(bs)) { return sdk.ActionNext() }
if string(bs) == "0" {
attr := []sdk.KeyVal{{Key: "ttft", Val: fmt.Sprintf("%d", p.httpStream[flowId].respFirstChunkedTime-p.httpStream[flowId].reqTime)},
{Key: "tpot", Val: fmt.Sprintf("%d", (baseCtx.Time-p.httpStream[flowId].respFirstChunkedTime)/p.httpStream[flowId].totalToken)}}
info := &sdk.L7ProtocolInfo{Req: &sdk.Request{}, Resp: &sdk.Response{}, Kv: attr}
delete(p.httpStream, flowId)
return sdk.ParseActionAbortWithL7Info([]*sdk.L7ProtocolInfo{info})
}
// first token chunk
if p.httpStream[flowId].flag == 0 {
p.httpStream[flowId].flag = 1
p.httpStream[flowId].respFirstChunkedTime = baseCtx.Time
p.httpStream[flowId].totalToken = uint64(len(bs))
return sdk.ActionNext()
}
// subsequent chunks
p.httpStream[flowId].totalToken += uint64(len(bs))
return sdk.ActionNext()
default:
return sdk.ActionNext()
}
}
func main() {
sdk.Warn("llm wasm plugin loaded")
llm := &llmParser{httpStream: map[uint64]*StreamInfo{}}
sdk.SetParser(llm)
}Compilation and deployment
tinygo build -o llm.wasm -target wasi -gc=precise -panic=trap -scheduler=none -no-debug
deepflow-ctl plugin create --type wasm --image llm.wasm --name llm ./llm/llm.go
deepflow-ctl plugin listThe WASM binary is uploaded to the DeepFlow server and distributed to agents without restarting the target service.
Grafana visualization
Collected metrics are displayed in Grafana dashboards integrated with the Pan‑Base UI and the panoramic topology view. TTFT, TPOT and token production rate enable rapid identification of performance bottlenecks and support billing monitoring for external models.
Roadmap
Prompt input/output tracing and token consumption monitoring.
Function‑level GPU/HBM performance profiling for training and fine‑tuning.
Millisecond‑level RDMA communication analysis.
Real‑time heterogeneous GPU performance indicators.
Reference
DeepFlow Wasm plugin documentation: https://deepflow.io/docs/zh/integration/process/wasm-plugin/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
