Cloud Native 8 min read

How to Stabilize Your Kubernetes Clusters: CI/CD, Monitoring, Logging, and Docs

This article analyzes why our Kubernetes clusters were constantly unstable—citing an erratic release process, missing monitoring, logging, documentation, and unclear request routing—and presents a comprehensive solution that includes a Kubernetes‑centric CI/CD pipeline, federated monitoring, centralized logging, a documentation hub, and integrated traffic management.

Efficient Ops
Efficient Ops
Efficient Ops
How to Stabilize Your Kubernetes Clusters: CI/CD, Monitoring, Logging, and Docs

Preface

Our clusters were constantly on the brink of failure; after three months we identified the main causes: unstable release process, lack of a monitoring platform, missing logging system, severe shortage of operational documentation, and unclear request routing.

Overall, the primary issue is the absence of a predictable monitoring platform; secondary issues are unclear server roles and unstable release processes.

Solution

Unstable Release Process

Refactor the release workflow by fully Kubernetes‑izing the business and building a CI/CD pipeline centered on Kubernetes.

Release workflow overview:

Brief analysis: developers push code to the developer branch, which is kept up‑to‑date, then merge to the target environment branch, trigger a WeChat alert, start a GitLab‑runner pod in the Kubernetes cluster, and run CI/CD steps (test cases, image build, pod update). The first deployment may need to create Namespace, imagePullSecret, PV, Deployment, Service, Ingress, etc. Images are pushed to Alibaba Cloud registry and pulled via VPC without public bandwidth limits. After the process the runner pod is destroyed and GitLab returns the result.

Note: resource manifests do not include ConfigMap or Secret for security reasons; Rancher is used as the multi‑cluster management platform, and such security concerns are handled in its dashboard.

Lack of Monitoring and Alerting Platform

Build a reliable federated monitoring platform that simultaneously monitors multiple clusters and provides pre‑failure alerts.

Because we have several Kubernetes clusters, deploying a separate monitoring stack per cluster would be cumbersome. Instead we adopt a federated approach with a unified visual interface, implementing three monitoring levels: OS, application, and business. Traffic monitoring targets Kong, using template 7424.

Missing Logging System

As the business fully migrates to Kubernetes, a robust, filterable logging system is needed to simplify fault analysis.

Brief analysis: after Kubernetes adoption, log management becomes harder because pod restarts generate new logs and previous logs become invisible. Options include remote storage or host‑mounted logs. We choose Elasticsearch to build a centralized log collection system.

Severe Lack of Operational Documentation

Establish a documentation hub centered on Yuque for operations, recording procedures, issues, scripts, etc., for easy reference.

Due to security concerns, documentation access is limited; nevertheless, thorough documentation of every operational step is essential.

Unclear Request Routing

Redesign cluster‑level traffic routing to provide integrated authentication, authorization, proxy, connection, protection, control, and observability, limiting fault blast radius.

Brief analysis: after Kong gateway authentication, traffic enters a namespace that isolates projects; microservices communicate via Istio for authentication and authorization; services interact with databases, storage, or conversion services as needed before responding.

Conclusion

By constructing a Kubernetes‑centric CI/CD pipeline, a Prometheus‑based federated monitoring platform, an Elasticsearch‑based logging system, a Yuque‑based documentation center, and a Kong‑plus‑Istio integrated traffic management layer, we can achieve high availability and reliability for our clusters.

MonitoringCloud NativeCI/CDKubernetesDevOpslogging
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.