Cloud Native 10 min read

Stabilizing Unstable Kubernetes Clusters: CI/CD, Monitoring, Logging Blueprint

This article analyzes the root causes of a company's unstable Kubernetes clusters and presents a comprehensive solution covering a revamped CI/CD pipeline, federated monitoring and alerting, centralized logging, documentation practices, and clear traffic routing to achieve high reliability and stability.

Open Source Linux
Open Source Linux
Open Source Linux
Stabilizing Unstable Kubernetes Clusters: CI/CD, Monitoring, Logging Blueprint

Introduction

Our company's clusters are constantly on the brink of collapse. Over the past three months we identified the main instability factors:

Unstable release process

Lack of a monitoring platform (the most critical issue)

Missing logging system

Severe shortage of operational documentation

Unclear request routing

Overall, the primary problem is the absence of a predictable monitoring platform; secondary issues are unclear server roles and an unstable release process.

Solution

Unstable Release Process

Refactor the release workflow by fully containerizing services and building a Kubernetes‑centric CI/CD pipeline.

Release Process

The workflow is as follows:

Developers push code to the

developer

branch, which is kept up‑to‑date. The branch is merged into the target release branch, triggering a WeChat Work alert and a GitLab Runner pod in the Kubernetes cluster. The CI/CD steps include test execution, image building, and pod update. Initial deployment may involve creating a namespace, image pull secret, persistent volume, deployment, service, and ingress. Images are pushed to an Alibaba Cloud repository and pulled via VPC without public bandwidth limits. After the pipeline finishes, the runner pod is destroyed and GitLab returns the result.

Note: ConfigMaps and Secrets are excluded from the resource list for security reasons.

Our code repository uses Rancher as a multi‑cluster management platform; security concerns are handled in Rancher’s dashboard.

Service Deployment Diagram

The diagram shows that Kong replaces Nginx for authentication, authorization, and proxying, with the SLB IP bound to Kong. Jobs 0‑2 are test jobs, 3 is the build job, and 4‑7 are pod‑change stages. Not all services require storage; decisions are made in

kubernetes.sh

. A unified CI template is recommended for all environments, and branch strategies should follow best practices.

Lack of Monitoring and Alerting Platform

Build a trustworthy federated monitoring platform tailored to our cluster environment, enabling simultaneous monitoring of multiple clusters and pre‑failure alerts.

Monitoring & Alerting Diagram

The solution combines Prometheus, shell or Go scripts, and Sentry. Alerts are sent via WeChat Work or corporate email. Three colored lines in the diagram represent different monitoring methods. Scripts handle backup alerts, certificate alerts, and intrusion detection. Prometheus uses a custom resource list based on the Prometheus‑operator, storing data on NAS. Sentry, while primarily a log‑collection tool, is used here for business‑level error monitoring.

We adopt a federated monitoring approach instead of deploying separate platforms per cluster, providing a unified visual interface. Monitoring is implemented at three levels: operating system, application, and business. Traffic monitoring targets Kong directly.

Federated Monitoring Architecture

Missing Logging System

As the business fully migrates to Kubernetes, a robust logging system becomes essential because pod restarts make log retrieval difficult.

Logging System Diagram

After Kubernetes adoption, logs are fragmented across pod lifecycles. Options for long‑term storage include remote storage or host‑mounted logs. For visualization and analysis, we choose Elasticsearch to build a centralized log collection system.

Severe Lack of Operational Documentation

Establish a documentation hub centered on Yuque for all operational procedures, scripts, and issue records, ensuring easy access while respecting security constraints.

Documentation must capture core steps despite security restrictions; concise yet comprehensive records are vital for both operations and development.

Unclear Request Routing

Redesign traffic routing across clusters to provide integrated authentication, authorization, proxy, connection, protection, control, and observability, thereby limiting fault propagation.

Request Routing Diagram

External requests pass through a Kong gateway for authentication, then enter specific namespaces (project isolation). Microservices communicate via Istio for mutual TLS, interact with databases, access persistent volumes, or invoke conversion services as needed before responding.

Conclusion

In summary, building a Kubernetes‑centric CI/CD pipeline, a Prometheus‑based federated monitoring and alerting platform, an Elasticsearch‑driven logging system, a Yuque‑based documentation center, and an integrated Kong‑Istio traffic management layer can ensure high availability and reliability for our clusters.

Overall architecture diagram:

Source: cnblogs.com/zisefeizhu/p/13692782.html
monitoringCI/CDOperationsKubernetesDevOpslogging
Open Source Linux
Written by

Open Source Linux

Focused on sharing Linux/Unix content, covering fundamentals, system development, network programming, automation/operations, cloud computing, and related professional knowledge.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.