Operations 10 min read

Diagnosing Kubernetes APIServer Outages with Logs, Metrics, and Traces

This article explains how to build a comprehensive observability stack for Kubernetes APIServer using Prometheus metrics, access‑log analysis, SPL‑driven time‑series generation, anomaly detection, root‑cause drill‑down, and OpenTelemetry tracing to quickly locate and resolve service disruptions.

Alibaba Cloud Observability

Mar 3, 2025

Diagnosing Kubernetes APIServer Outages with Logs, Metrics, and Traces

Observability for APIServer Using Logs, Metrics, and Traces

With the rise of large‑model APIs, many enterprises embed powerful models into their services, making API reliability crucial. A global OpenAI outage on December 11 highlighted how excessive requests to the Kubernetes APIServer can overload the APIServer, break DNS resolution, and cause widespread service disruption.

Metric‑Based Monitoring

Prometheus provides a rich built‑in metric system that can cover most component monitoring needs. Alibaba Cloud’s Prometheus Observability Service offers out‑of‑the‑box dashboards and default metric collections.

Enhanced Log Collection (Outbound Data Link)

When the APIServer and its monitoring agents run on the same cluster, a cluster failure can also bring down the monitoring system. Alibaba Cloud introduces an “out‑bound data link” that remains functional even if the internal cluster network fails, ensuring logs and events are still collected.

Access Log Details

APIServer access logs record request source, status, latency, client (userAgent), resource (uri), and verb. Example log entry:

I1219 15:30:45.123456 12345 audit.go:123] "Audit" verb="create" uri="/api/v1/namespaces/default/pods" user="system:serviceaccount:kube-system:default" srcIP="192.168.1.100:56789" userAgent="ilogtail/v0.0.0" response=201

Typical dimensions include dozens of userAgents, hundreds of resources, and multiple verbs, creating a combinatorial explosion (e.g., 50 × 100 × 10 ≈ 5 × 10⁴ possible combos).

From Logs to Time‑Series with SPL

By aggregating userAgent + uri + verb per minute, we can generate a QPS time‑series and use SPL (SLS Processing Language) to detect anomalies.

* | extend ts=second_to_nano(date_trunc(60, __time__))
| stats request_count=count(1) by ts
| make-series request_count_arr=request_count on ts

The resulting global QPS chart shows a clear spike between 12‑17 09:20 and 09:35.

Anomaly Detection

... | extend ret=series_decompose_anomalies(ts, request_count_arr)
| extend anomalies_score_series=ret.anomalies_score_series
| project ts, request_count_arr, anomalies_score_series

The algorithm flags the interval [1734398340, 1734399120] with a high anomaly score (0.85).

Start (s)

End (s)

Score

1734345480

0.01

1734398220

0.07

1734398340

1734399120

0.85

1734399840

0.03

1734482880

0.13

1734526740

0.04

Root‑Cause Drill‑Down

Using the series_drilldown operator on the dimensions (userAgent, verb, resource) isolates the offending combination.

* | extend resource=json_extract_scalar(objectRef, '$.resource')
| extend ts=second_to_nano(date_trunc(60, __time__))
| stats request_count=count(1) by ts, userAgent, verb, resource
| make-series request_count_arr=request_count on ts by userAgent, verb, resource
| stats userAgent_arr=array_agg(userAgent), verb_arr=array_agg(verb), resource_arr=array_agg(resource), ts_arr=array_agg(ts), metrics_arr=array_agg(request_count_arr)
| extend ret=series_drilldown(userAgent_arr, verb_arr, resource_arr, ts_arr, metrics_arr, 1734398340000000000, 1734399120000000000)
| project ret

The result points to verb=GET and resource=leases as the anomalous dimension.

{
  "attrs": [{"verb": "GET", "resource": "leases"}],
  "statistics": {
    "relative_ratio": 0.84611478200,
    "relative_unexpected_difference": 0.73918632088300926,
    "difference": -293.23809523809524,
    "predict": 53.33333333333331,
    "real": 346.57142857142856,
    "support": 117
  }
}

Aggregating requests matching this combo confirms a sharp increase, suggesting issues with etcd performance, network latency, or abnormal node/pod counts.

Trace‑Based Diagnosis

Beyond logs and metrics, OpenTelemetry tracing provides end‑to‑end visibility of APIServer operations, including HTTP requests, webhook calls, etcd interactions, and re‑entrant requests.

Alibaba Cloud Container Service (ACK) offers an out‑of‑box APIServer tracing solution that automatically reports control‑plane traces to the OpenTelemetry backend.

Enable tracing in the ACK component management for APIServer.

View trace data in the call‑chain analysis console.

Inspect the full processing flow of a Deployment API, covering authentication, etcd queries, and object serialization.

Conclusion

By combining Log, Metric, and Trace data for critical components like APIServer, teams can achieve a three‑dimensional observability coverage that accelerates fault analysis, reduces downtime, and improves overall system stability. Explore more observability cases at the Alibaba Cloud SLS documentation site.

Kubernetes OpenTelemetry Prometheus log analysis aiops apiserver

Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.