How to Stabilize Your Kubernetes Clusters: CI/CD, Monitoring, Logging, and Docs
This article analyzes why our Kubernetes clusters were constantly unstable—citing an erratic release process, missing monitoring, logging, documentation, and unclear request routing—and presents a comprehensive solution that includes a Kubernetes‑centric CI/CD pipeline, federated monitoring, centralized logging, a documentation hub, and integrated traffic management.
Preface
Our clusters were constantly on the brink of failure; after three months we identified the main causes: unstable release process, lack of a monitoring platform, missing logging system, severe shortage of operational documentation, and unclear request routing.
Overall, the primary issue is the absence of a predictable monitoring platform; secondary issues are unclear server roles and unstable release processes.
Solution
Unstable Release Process
Refactor the release workflow by fully Kubernetes‑izing the business and building a CI/CD pipeline centered on Kubernetes.
Release workflow overview:
Brief analysis: developers push code to the developer branch, which is kept up‑to‑date, then merge to the target environment branch, trigger a WeChat alert, start a GitLab‑runner pod in the Kubernetes cluster, and run CI/CD steps (test cases, image build, pod update). The first deployment may need to create Namespace, imagePullSecret, PV, Deployment, Service, Ingress, etc. Images are pushed to Alibaba Cloud registry and pulled via VPC without public bandwidth limits. After the process the runner pod is destroyed and GitLab returns the result.
Note: resource manifests do not include ConfigMap or Secret for security reasons; Rancher is used as the multi‑cluster management platform, and such security concerns are handled in its dashboard.
Lack of Monitoring and Alerting Platform
Build a reliable federated monitoring platform that simultaneously monitors multiple clusters and provides pre‑failure alerts.
Because we have several Kubernetes clusters, deploying a separate monitoring stack per cluster would be cumbersome. Instead we adopt a federated approach with a unified visual interface, implementing three monitoring levels: OS, application, and business. Traffic monitoring targets Kong, using template 7424.
Missing Logging System
As the business fully migrates to Kubernetes, a robust, filterable logging system is needed to simplify fault analysis.
Brief analysis: after Kubernetes adoption, log management becomes harder because pod restarts generate new logs and previous logs become invisible. Options include remote storage or host‑mounted logs. We choose Elasticsearch to build a centralized log collection system.
Severe Lack of Operational Documentation
Establish a documentation hub centered on Yuque for operations, recording procedures, issues, scripts, etc., for easy reference.
Due to security concerns, documentation access is limited; nevertheless, thorough documentation of every operational step is essential.
Unclear Request Routing
Redesign cluster‑level traffic routing to provide integrated authentication, authorization, proxy, connection, protection, control, and observability, limiting fault blast radius.
Brief analysis: after Kong gateway authentication, traffic enters a namespace that isolates projects; microservices communicate via Istio for authentication and authorization; services interact with databases, storage, or conversion services as needed before responding.
Conclusion
By constructing a Kubernetes‑centric CI/CD pipeline, a Prometheus‑based federated monitoring platform, an Elasticsearch‑based logging system, a Yuque‑based documentation center, and a Kong‑plus‑Istio integrated traffic management layer, we can achieve high availability and reliability for our clusters.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.