Cloud Native 11 min read

Turning APIServer Logs into Time‑Series Metrics for Fast Root‑Cause Detection

This article explains how to enrich Kubernetes APIServer observability by converting access logs into time‑series metrics, applying SPL‑based aggregation, anomaly detection, and root‑cause drill‑down, and supplementing with OpenTelemetry tracing to quickly pinpoint failures during large‑scale outages.

Alibaba Cloud Native

Feb 25, 2025

Turning APIServer Logs into Time‑Series Metrics for Fast Root‑Cause Detection

With the rise of large language models, many enterprises aim to embed these models into their products via APIs, which act as bridges to make model capabilities widely and conveniently usable while ensuring safety and maintainability.

On December 11, OpenAI suffered a global outage affecting ChatGPT, API, Sora, Playground, and Labs for over four hours. The root cause was a surge of requests to the Kubernetes APIServer from a new deployment, overloading the APIServer, breaking DNS resolution, and disrupting data‑plane services. This highlights the importance of a comprehensive Log/Trace/Metric coverage to quickly alert, locate root causes, and reduce downtime.

Metric Collection

Prometheus provides a rich built‑in metric system that can monitor most components. Alibaba Cloud’s Prometheus Observability Service offers out‑of‑the‑box metrics and dashboards.

Enhanced Collection Link (Out‑bound)

When a cluster experiences issues, in‑bound data links fail together with the cluster, but an out‑bound link remains functional, allowing logs and events to be collected even during internal failures.

Access Logs

APIServer access logs record request source, status, latency, etc. Example log line:

I1219 15:30:45.123456 12345 audit.go:123] "Audit" verb="create" uri="/api/v1/namespaces/default/pods" user="system:serviceaccount:kube-system:default" srcIP="192.168.1.100:56789" userAgent="ilogtail/v0.0.0" response=201

Typical dimensions include:

Client (userAgent): over 50 different sources such as ilogtail/v0.0.0, metrics‑server/v0.0.0, cert‑manager/v1.9.1.

K8s Resource (uri): more than 100 resources like services, leases, ingresses.

Verb: GET, LIST, WATCH, etc.

Combining these dimensions yields a huge space (≈50 × 100 × 10 = 5 × 10⁴ possible combinations), making manual root‑cause hunting like finding a needle in a haystack.

From Logs to Time‑Series Metrics

By aggregating userAgent + uri + verb per minute, a QPS time‑series can be built. SPL (SLS Processing Language) is used to generate the series and apply AIOps algorithms.

Global QPS Aggregation and Anomaly Detection

First aggregate all dimensions into a global QPS metric:

* | extend ts= second_to_nano(date_trunc(60, __time__))
| stats request_count=count(1) by ts
| make-series request_count_arr = request_count on ts

The resulting series shows a clear spike between 12‑17 09:20 and 09:35. Anomaly detection is performed with the series_decompose_anomalies operator:

... | extend ret = series_decompose_anomalies(ts, request_count_arr)
| extend anomalies_score_series = ret.anomalies_score_series
| project ts, request_count_arr, anomalies_score_series

High anomaly scores are observed in the interval 1734398340 – 1734399120 (score 0.85). Other detected intervals include:

1734345480 – 1734345480 (score 0.01)

1734398220 – 1734398220 (score 0.07)

1734399840 – 1734399840 (score 0.03)

1734482880 – 1734482880 (score 0.13)

1734526740 – 1734526740 (score 0.04)

Root‑Cause Drill‑Down

To pinpoint the offending dimension combination, the series_drilldown operator is applied on userAgent, verb, and resource:

* | extend resource = json_extract_scalar(objectRef, '$.resource')
| extend ts= second_to_nano(date_trunc(60, __time__))
| stats request_count=count(1) by ts, userAgent, verb, resource
| make-series request_count_arr = request_count on ts by userAgent, verb, resource
| stats userAgent_arr = array_agg(userAgent), verb_arr = array_agg(verb), resource_arr = array_agg(resource), ts_arr = array_agg(ts), metrics_arr = array_agg(latency_arr)
| extend ret = series_drilldown(userAgent_arr, verb_arr, resource_arr, ts_arr, metrics_arr, 1734398340000000000, 1734399120000000000)
| project ret

The algorithm returns the combination verb=GET and resource=leases as the likely cause.

{
  "attrs": [{ "verb": "GET", "resource": "leases" }],
  "statistics": {
    "relative_ratio": 0.84611478200,
    "relative_unexpected_difference": 0.73918632088300926,
    "difference": -293.23809523809524,
    "predict": 53.33333333333331,
    "real": 346.57142857142856,
    "support": 117
  }
}

Aggregating requests matching this combination confirms a sharp increase, suggesting possible issues with etcd performance, network latency, or abnormal node/pod counts, since leases are stored in etcd for node heartbeats and leader election.

Trace‑Based Diagnosis

Beyond logs and metrics, OpenTelemetry‑based tracing can diagnose critical operations. APIServer can trace incoming HTTP requests, webhook calls, etcd operations, and re‑entrant requests.

Steps to enable tracing in Alibaba Cloud Container Service (ACK):

Enable APIServer tracing in the ACK component management console.

After enabling, view APIServer trace data in the call‑chain analysis UI.

Observe the full processing flow of a Deployment API request, including authentication, etcd query, and response serialization.

Combining Log, Metric, and Trace provides a three‑dimensional observability coverage that accelerates fault analysis, reduces MTTR, and improves system stability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Prometheus aiops apiserver SPL

Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.