Stabilizing Unstable Kubernetes Clusters: CI/CD, Monitoring, Logging Blueprint
This article analyzes the root causes of a company's unstable Kubernetes clusters and presents a comprehensive solution covering a revamped CI/CD pipeline, federated monitoring and alerting, centralized logging, documentation practices, and clear traffic routing to achieve high reliability and stability.
Introduction
Our company's clusters are constantly on the brink of collapse. Over the past three months we identified the main instability factors:
Unstable release process
Lack of a monitoring platform (the most critical issue)
Missing logging system
Severe shortage of operational documentation
Unclear request routing
Overall, the primary problem is the absence of a predictable monitoring platform; secondary issues are unclear server roles and an unstable release process.
Solution
Unstable Release Process
Refactor the release workflow by fully containerizing services and building a Kubernetes‑centric CI/CD pipeline.
Release Process
The workflow is as follows:
Developers push code to the
developerbranch, which is kept up‑to‑date. The branch is merged into the target release branch, triggering a WeChat Work alert and a GitLab Runner pod in the Kubernetes cluster. The CI/CD steps include test execution, image building, and pod update. Initial deployment may involve creating a namespace, image pull secret, persistent volume, deployment, service, and ingress. Images are pushed to an Alibaba Cloud repository and pulled via VPC without public bandwidth limits. After the pipeline finishes, the runner pod is destroyed and GitLab returns the result.
Note: ConfigMaps and Secrets are excluded from the resource list for security reasons.
Our code repository uses Rancher as a multi‑cluster management platform; security concerns are handled in Rancher’s dashboard.
Service Deployment Diagram
The diagram shows that Kong replaces Nginx for authentication, authorization, and proxying, with the SLB IP bound to Kong. Jobs 0‑2 are test jobs, 3 is the build job, and 4‑7 are pod‑change stages. Not all services require storage; decisions are made in
kubernetes.sh. A unified CI template is recommended for all environments, and branch strategies should follow best practices.
Lack of Monitoring and Alerting Platform
Build a trustworthy federated monitoring platform tailored to our cluster environment, enabling simultaneous monitoring of multiple clusters and pre‑failure alerts.
Monitoring & Alerting Diagram
The solution combines Prometheus, shell or Go scripts, and Sentry. Alerts are sent via WeChat Work or corporate email. Three colored lines in the diagram represent different monitoring methods. Scripts handle backup alerts, certificate alerts, and intrusion detection. Prometheus uses a custom resource list based on the Prometheus‑operator, storing data on NAS. Sentry, while primarily a log‑collection tool, is used here for business‑level error monitoring.
We adopt a federated monitoring approach instead of deploying separate platforms per cluster, providing a unified visual interface. Monitoring is implemented at three levels: operating system, application, and business. Traffic monitoring targets Kong directly.
Federated Monitoring Architecture
Missing Logging System
As the business fully migrates to Kubernetes, a robust logging system becomes essential because pod restarts make log retrieval difficult.
Logging System Diagram
After Kubernetes adoption, logs are fragmented across pod lifecycles. Options for long‑term storage include remote storage or host‑mounted logs. For visualization and analysis, we choose Elasticsearch to build a centralized log collection system.
Severe Lack of Operational Documentation
Establish a documentation hub centered on Yuque for all operational procedures, scripts, and issue records, ensuring easy access while respecting security constraints.
Documentation must capture core steps despite security restrictions; concise yet comprehensive records are vital for both operations and development.
Unclear Request Routing
Redesign traffic routing across clusters to provide integrated authentication, authorization, proxy, connection, protection, control, and observability, thereby limiting fault propagation.
Request Routing Diagram
External requests pass through a Kong gateway for authentication, then enter specific namespaces (project isolation). Microservices communicate via Istio for mutual TLS, interact with databases, access persistent volumes, or invoke conversion services as needed before responding.
Conclusion
In summary, building a Kubernetes‑centric CI/CD pipeline, a Prometheus‑based federated monitoring and alerting platform, an Elasticsearch‑driven logging system, a Yuque‑based documentation center, and an integrated Kong‑Istio traffic management layer can ensure high availability and reliability for our clusters.
Overall architecture diagram:
Source: cnblogs.com/zisefeizhu/p/13692782.html
Open Source Linux
Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.