How to Stabilize a Failing Kubernetes Cluster: CI/CD, Monitoring, Logging, and Docs
This article analyzes why a company's Kubernetes clusters were constantly on the brink of failure and presents a comprehensive solution covering CI/CD pipeline reconstruction, federated monitoring with Prometheus, centralized logging via Elasticsearch, documentation centralization, and clarified request routing to achieve high reliability.
Background
The company's clusters were repeatedly near collapse; a three‑month investigation identified five root causes: an unstable release process, lack of a monitoring platform (the most critical), missing logging system, severe shortage of operational documentation, and unclear request routing.
Solution Overview
1. Unstable Release Process
The release workflow is rebuilt around Kubernetes, establishing a CI/CD pipeline driven by GitLab‑Runner pods. Developers commit to the developer branch, merge into environment‑specific branches, trigger WeChat alerts, and start CI/CD jobs that run test cases, build container images, and update pods.
First‑time deployment steps include creating a namespace, image‑pull secret, PV (via StorageClass), Deployment, Service, and Ingress. Images are pushed to an Alibaba Cloud registry accessed via VPC, avoiding public bandwidth limits. Security‑sensitive resources such as ConfigMaps or Secrets are omitted from the manifest and managed through Rancher’s dashboard.
2. Lack of Monitoring & Alert Platform
A federated monitoring solution is built using Prometheus, custom shell/Go scripts, and Sentry for alerting via WeChat or email. The architecture separates three monitoring layers—operating‑system, application, and business—and consolidates alerts across multiple clusters into a single visual interface.
Prometheus is deployed via a customized Prometheus‑Operator manifest with data stored on NAS. Sentry, although primarily a log‑collection tool, is leveraged for business‑logic monitoring of application crashes.
3. Missing Logging System
To achieve observable, filterable logs, an Elasticsearch‑based logging stack is introduced. The design addresses the challenge that pod restarts generate new log streams, making historical logs invisible.
Options such as remote storage or host‑mounted logs were considered; Elasticsearch was chosen for its visualization and analysis capabilities.
4. Severe Lack of Operational Documentation
A documentation hub centered on Yuque is established to record operational procedures, scripts, and issue resolutions, ensuring that critical knowledge is preserved and easily searchable.
5. Unclear Request Routing
The traffic flow is redesigned to provide authentication, authorization, proxy, connection, protection, control, and observability in a unified manner. Kong replaces Nginx for gateway functions, while Istio handles service‑to‑service security.
Requests pass through Kong, are routed to specific namespaces, then traverse Istio for intra‑service authentication before reaching databases, PVs, or conversion services as needed.
Conclusion
By constructing a Kubernetes‑centric CI/CD pipeline, a Prometheus‑based federated monitoring platform, an Elasticsearch logging system, a Yuque documentation center, and a Kong‑Istio integrated traffic management layer, the company can achieve high availability and stability for its clusters. The overall architecture diagram illustrates the modular composition of these components.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
