Operations 10 min read

Diagnosing Kubernetes APIServer Outages with Logs, Metrics, and Traces

This article explains how to build a comprehensive observability stack for Kubernetes APIServer using Prometheus metrics, access‑log analysis, SPL‑driven time‑series generation, anomaly detection, root‑cause drill‑down, and OpenTelemetry tracing to quickly locate and resolve service disruptions.

Alibaba Cloud Observability
Alibaba Cloud Observability
Alibaba Cloud Observability
Diagnosing Kubernetes APIServer Outages with Logs, Metrics, and Traces

Observability for APIServer Using Logs, Metrics, and Traces

With the rise of large‑model APIs, many enterprises embed powerful models into their services, making API reliability crucial. A global OpenAI outage on December 11 highlighted how excessive requests to the Kubernetes APIServer can overload the APIServer, break DNS resolution, and cause widespread service disruption.

Metric‑Based Monitoring

Prometheus provides a rich built‑in metric system that can cover most component monitoring needs. Alibaba Cloud’s Prometheus Observability Service offers out‑of‑the‑box dashboards and default metric collections.

图片
图片

Enhanced Log Collection (Outbound Data Link)

When the APIServer and its monitoring agents run on the same cluster, a cluster failure can also bring down the monitoring system. Alibaba Cloud introduces an “out‑bound data link” that remains functional even if the internal cluster network fails, ensuring logs and events are still collected.

Access Log Details

APIServer access logs record request source, status, latency, client (userAgent), resource (uri), and verb. Example log entry:

I1219 15:30:45.123456 12345 audit.go:123] "Audit" verb="create" uri="/api/v1/namespaces/default/pods" user="system:serviceaccount:kube-system:default" srcIP="192.168.1.100:56789" userAgent="ilogtail/v0.0.0" response=201

Typical dimensions include dozens of userAgents, hundreds of resources, and multiple verbs, creating a combinatorial explosion (e.g., 50 × 100 × 10 ≈ 5 × 10⁴ possible combos).

From Logs to Time‑Series with SPL

By aggregating userAgent + uri + verb per minute, we can generate a QPS time‑series and use SPL (SLS Processing Language) to detect anomalies.

* | extend ts=second_to_nano(date_trunc(60, __time__))
| stats request_count=count(1) by ts
| make-series request_count_arr=request_count on ts

The resulting global QPS chart shows a clear spike between 12‑17 09:20 and 09:35.

图片
图片

Anomaly Detection

... | extend ret=series_decompose_anomalies(ts, request_count_arr)
| extend anomalies_score_series=ret.anomalies_score_series
| project ts, request_count_arr, anomalies_score_series

The algorithm flags the interval [1734398340, 1734399120] with a high anomaly score (0.85).

Start (s)

End (s)

Score

1734345480

1734345480

0.01

1734398220

1734398220

0.07

1734398340

1734399120

0.85

1734399840

1734399840

0.03

1734482880

1734482880

0.13

1734526740

1734526740

0.04

Root‑Cause Drill‑Down

Using the series_drilldown operator on the dimensions (userAgent, verb, resource) isolates the offending combination.

* | extend resource=json_extract_scalar(objectRef, '$.resource')
| extend ts=second_to_nano(date_trunc(60, __time__))
| stats request_count=count(1) by ts, userAgent, verb, resource
| make-series request_count_arr=request_count on ts by userAgent, verb, resource
| stats userAgent_arr=array_agg(userAgent), verb_arr=array_agg(verb), resource_arr=array_agg(resource), ts_arr=array_agg(ts), metrics_arr=array_agg(request_count_arr)
| extend ret=series_drilldown(userAgent_arr, verb_arr, resource_arr, ts_arr, metrics_arr, 1734398340000000000, 1734399120000000000)
| project ret

The result points to verb=GET and resource=leases as the anomalous dimension.

{
  "attrs": [{"verb": "GET", "resource": "leases"}],
  "statistics": {
    "relative_ratio": 0.84611478200,
    "relative_unexpected_difference": 0.73918632088300926,
    "difference": -293.23809523809524,
    "predict": 53.33333333333331,
    "real": 346.57142857142856,
    "support": 117
  }
}

Aggregating requests matching this combo confirms a sharp increase, suggesting issues with etcd performance, network latency, or abnormal node/pod counts.

图片
图片

Trace‑Based Diagnosis

Beyond logs and metrics, OpenTelemetry tracing provides end‑to‑end visibility of APIServer operations, including HTTP requests, webhook calls, etcd interactions, and re‑entrant requests.

Alibaba Cloud Container Service (ACK) offers an out‑of‑box APIServer tracing solution that automatically reports control‑plane traces to the OpenTelemetry backend.

Enable tracing in the ACK component management for APIServer.

View trace data in the call‑chain analysis console.

Inspect the full processing flow of a Deployment API, covering authentication, etcd queries, and object serialization.

图片
图片
图片
图片
图片
图片

Conclusion

By combining Log, Metric, and Trace data for critical components like APIServer, teams can achieve a three‑dimensional observability coverage that accelerates fault analysis, reduces downtime, and improves overall system stability. Explore more observability cases at the Alibaba Cloud SLS documentation site.

KubernetesOpenTelemetryPrometheuslog analysisaiopsapiserver
Alibaba Cloud Observability
Written by

Alibaba Cloud Observability

Driving continuous progress in observability technology!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.