Full‑Stack Monitoring of Kubernetes with Prometheus and Grafana (Part 4)
This guide walks through setting up Prometheus and Grafana to monitor a Kubernetes cluster and all business pods, covering the deployment of kube‑state‑metrics, the required RBAC objects, service definitions, and detailed Prometheus scrape configurations for both kube‑state‑metrics and cAdvisor.
The article explains how to monitor a Kubernetes cluster and every business pod by deploying Prometheus and Grafana. It starts by creating the kube‑state‑metrics service, which provides cluster‑level metrics needed for comprehensive observability.
First, a ServiceAccount, ClusterRole, ClusterRoleBinding, Deployment, and Service are defined. The YAML manifests are:
apiVersion: v1
kind: ServiceAccount
metadata:
name: kube-state-metrics
namespace: kube-system
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: kube-state-metrics
rules:
- apiGroups: [""]
resources:
- nodes
- pods
- services
- endpoints
- replicationcontrollers
- limitranges
- persistentvolumeclaims
- persistentvolumes
- namespaces
- resourcequotas
verbs: ["list", "watch"]
- apiGroups: ["apps"]
resources:
- deployments
- daemonsets
- statefulsets
- replicasets
verbs: ["list", "watch"]
- apiGroups: ["batch"]
resources:
- jobs
- cronjobs
verbs: ["list", "watch"]
- apiGroups: ["autoscaling"]
resources:
- horizontalpodautoscalers
verbs: ["list", "watch"]
- apiGroups: ["networking.k8s.io"]
resources:
- ingresses
- networkpolicies
verbs: ["list", "watch"]
- apiGroups: ["storage.k8s.io"]
resources:
- storageclasses
- volumeattachments
verbs: ["list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: kube-state-metrics
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: kube-state-metrics
subjects:
- kind: ServiceAccount
name: kube-state-metrics
namespace: kube-system
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
app: kube-state-metrics
spec:
replicas: 1
selector:
matchLabels:
app: kube-state-metrics
template:
metadata:
labels:
app: kube-state-metrics
spec:
serviceAccountName: kube-state-metrics
containers:
- name: kube-state-metrics
# Using a domestic DaoCloud mirror for faster pulls
image: k8s.m.daocloud.io/kube-state-metrics/kube-state-metrics:v2.13.0
ports:
- name: http-metrics
containerPort: 8080
resources:
requests:
cpu: 100m
memory: 256Mi
limits:
cpu: 500m
memory: 512Mi
---
apiVersion: v1
kind: Service
metadata:
name: kube-state-metrics
namespace: kube-system
labels:
app: kube-state-metrics
spec:
type: ClusterIP
ports:
- name: http-metrics
port: 8080
targetPort: 8080
protocol: TCP
selector:
app: kube-state-metricsAfter deploying these resources, the article adds Prometheus scrape jobs. The first job collects metrics from the kube-state-metrics service:
- job_name: 'kubernetes-kube-state-metrics'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
kubernetes_sd_configs:
- role: service
namespaces:
names: ["kube-system"]
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: ^kube-state-metrics$
- source_labels: [__meta_kubernetes_service_port_number]
action: keep
regex: ^8080$The second job scrapes cAdvisor metrics from each node, using HTTPS and the Kubernetes API proxy:
- job_name: 'kubernetes-cadvisor'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics/cadvisor
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
insecure_skip_verify: true
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
kubernetes_sd_configs:
- role: node
relabel_configs:
- target_label: __address__
replacement: kubernetes.default.svc:443
- source_labels: [__meta_kubernetes_node_name]
regex: (.+)
target_label: __metrics_path__
replacement: /api/v1/nodes/${1}/proxy/metrics/cadvisor
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)Once the configuration files are added, the Prometheus server (referred to as “p8s” in the original text) must be reloaded or restarted to apply the new scrape jobs.
The article also includes three screenshots illustrating the configuration steps and the resulting monitoring dashboards; these are embedded as images.
Template identifiers “14249, 15661” are noted, likely referring to internal template numbers used for the deployment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Linux Cloud-Native Ops Stack
Focused on practical internet operations, sharing server monitoring, troubleshooting, automated deployment, and cloud-native tech insights. From Linux basics to advanced K8s, from ops tools to architecture optimization, helping engineers avoid pitfalls, grow quickly, and become your tech companion.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
