How Kubernetes Monitoring Evolved: From Heapster to Metrics‑Server and Prometheus
This article explains the fundamentals of monitoring and logging in large‑scale Kubernetes clusters, classifies monitoring types, traces the evolution from Heapster to the lightweight metrics‑server, outlines the three Kubernetes monitoring APIs, reviews Prometheus as the de‑facto standard, and describes Alibaba Cloud’s enhanced monitoring and logging solutions.
Why Monitoring and Logging Matter
Monitoring and logging are essential infrastructure for large distributed systems; monitoring shows runtime health while logs help diagnose problems. In Kubernetes they are part of the ecosystem rather than core components, so most capabilities rely on cloud‑provider integrations that implement the standard interfaces.
Types of Monitoring
Resource Monitoring – CPU, memory, network usage, typically exposed as numeric or percentage metrics (e.g., Zabbix, Telegraf).
Performance Monitoring (APM) – Hook‑based collection of JVM, PHP, or other runtime metrics such as GC cycles, memory generations, or connection counts for performance tuning.
Security Monitoring – Policies for privilege escalation, vulnerability scanning, etc.
Event Monitoring – Captures normal and warning events from state‑machine transitions; warning events are forwarded to alert channels (DingTalk, SMS, email) after offline aggregation.
Kubernetes Monitoring Evolution
Early Kubernetes (< 1.10) used Heapster for metric collection. Each node ran a bundled cAdvisor that exposed three APIs: a summary API, the kubelet API, and a Prometheus‑compatible API. Heapster periodically pulled data from nodes, aggregated it in memory, and exposed a service for consumers such as the dashboard or HPA controller.
Heapster was deprecated because:
It could not easily expose custom metrics (e.g., online user count) beyond basic resource data.
Its many sinks (InfluxDB, SLS, DingTalk, etc.) were poorly maintained, leading to bugs that remained unfixed.
These issues motivated the creation of the lightweight metrics‑server , which retains the core cAdvisor data source but provides a simplified API registration layer that registers metrics directly with the Kubernetes API server.
Kubernetes Monitoring API Standards
Resource Metrics – Implemented by metrics.k8s.io (metrics‑server). Provides node‑, pod‑, namespace‑, and class‑level resource metrics.
Custom Metrics – Implemented by custom.metrics.k8s.io (typically Prometheus). Allows applications to expose arbitrary metrics (e.g., online users, MySQL slow queries) via the Prometheus client library.
External Metrics – Implemented by external.metrics.k8s.io. Enables consumption of cloud‑provider metrics such as message‑queue depth or load‑balancer request counts; Alibaba Cloud provides a Cloud‑Metrics‑Adapter for this API.
Prometheus – The De‑Facto Open‑Source Monitoring Standard
Prometheus is a CNCF graduated project and the preferred monitoring backend for many cloud‑native projects (Spark, TensorFlow, Flink, etc.). It offers three collection modes:
Pushgateway – Short‑lived jobs push metrics to a gateway, which Prometheus then scrapes.
Pull – Prometheus directly scrapes targets at regular intervals.
Prometheus‑on‑Prometheus – One Prometheus instance scrapes another for federation.
Prometheus supports service discovery (including native Kubernetes discovery via annotations), integrates with Alertmanager for email/SMS alerts, and can be visualized via Grafana or the built‑in web UI. Its key strengths are a simple client library, multiple ingestion methods, tight Kubernetes compatibility, a rich plugin ecosystem, and the powerful Prometheus Operator for lifecycle management.
Logging Scenarios in Kubernetes
Four major logging scenarios are covered:
Host Kernel Logs – Diagnose network stack issues, driver failures, filesystem problems, kernel panics, OOM events.
Runtime Logs – Docker engine logs help troubleshoot pod hangs and container failures.
Core Component Logs – etcd, API server, scheduler, controller‑manager, and other control‑plane components provide insight into cluster health.
Application Logs – Business‑level logs reveal HTTP 500 errors, panics, and other application‑specific failures.
Log Collection Approaches
Host‑File Collection – Containers write logs to a host volume; a host‑side agent tails the files and forwards them.
Sidecar Streaming – A sidecar container streams logs to stdout, which is then captured by a local log‑rotator and an external agent.
Direct stdout – Logs are emitted to stdout and either collected by an agent or sent directly to a service such as SLS via its API.
The community‑recommended solution is Fluentd , which runs an agent on each node, forwards logs to a Fluentd server, and then ships them to back‑ends like Elasticsearch (visualized with Kibana) or InfluxDB (visualized with Grafana).
kube‑eventer – Offline Event Exporter
kube‑eventer (open‑source on GitHub) watches Kubernetes events (pod, node, component, CRD) via the API server and forwards them to sinks such as SLS, DingTalk, Kafka, or InfluxDB for offline audit, monitoring, and alerting. It can surface warning events (e.g., pod back‑off) through DingTalk notifications.
Alibaba Cloud Container Service Monitoring Stack
The platform integrates four products:
SLS (Log Service) – Central log repository; collects audit logs, ingress logs, and application logs, with optional export to OSS or MaxCompute for archiving.
ARMS – Application performance monitoring for Java and PHP, providing diagnostics and tuning capabilities.
AHAS – Architecture‑aware monitoring that visualizes service topology, network bandwidth, traffic, and abnormal events.
Cloud Monitor – Basic resource‑metrics monitoring (node, pod) with alerting.
Alibaba Cloud adds enhancements:
Retains Heapster‑compatible sinks in the metrics‑server, allowing data export to SLS or InfluxDB.
Provides full Heapster compatibility across Kubernetes 1.7 – 1.14, preventing breakage when upgrading dashboards.
Extends npd with additional checks (kernel‑hang, SNAT, file‑descriptor usage) and integrates with eventer for Kafka/DingTalk alerts.
Offers managed Prometheus with Helm charts, integrates with HiTSDB/InfluxDB for storage, and supplies specialized exporters for Spark, TensorFlow, Argo, etc.
Supports GPU monitoring (single‑card and shared GPU metrics).
Log Enhancements on Alibaba Cloud
Log collection now covers pod logs, core component logs, Docker engine logs, kernel logs, and middleware logs, all forwarded to SLS. From SLS logs can be streamed to OSS, MaxCompute, OpenSearch, E‑Map, or Flink for real‑time analysis. Visualization is possible via Grafana or DataV.
Key Takeaways
Monitoring in Kubernetes includes resource, performance, security, and event dimensions.
Kubernetes monitoring has evolved from Heapster to the streamlined metrics‑server, with three standardized APIs (resource, custom, external).
Prometheus remains the dominant open‑source monitoring solution, offering flexible collection modes and rich ecosystem tools.
Logging spans host kernel, runtime, core component, and application layers; Fluentd is the recommended collector.
Alibaba Cloud extends the open‑source stack with compatible sinks, enhanced npd checks, managed Prometheus, and integrated log‑service pipelines for end‑to‑end observability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
