Operations 10 min read

Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions

The article analyzes the instability of a company's Kubernetes clusters, identifies root causes such as unstable release processes, lack of monitoring, logging, and documentation, and proposes comprehensive solutions including a Kubernetes‑centric CI/CD pipeline, federated Prometheus monitoring, Elasticsearch logging, centralized documentation, and integrated traffic management with Kong and Istio.

Sohu Tech Products
Sohu Tech Products
Sohu Tech Products
Improving Cluster Stability: CI/CD, Monitoring, Logging, Documentation, and Traffic Management Solutions

Preface

Our company's clusters are constantly on the brink of collapse. Over the past three months we identified five main reasons for instability: unstable release process, lack of a monitoring platform (the most critical), missing logging system, severe shortage of operational documentation, and unclear request routing.

Solution Overview

Unstable Release Process

Refactor the release workflow by fully containerizing services and building a Kubernetes‑centric CI/CD pipeline.

Release Process

The workflow is as follows: developers commit code to the developer branch, which is kept up‑to‑date; this branch merges into the target environment branch, triggering an enterprise‑WeChat alert and a GitLab‑Runner pod in the Kubernetes cluster. The runner executes CI/CD steps—test cases, image build, and pod update. Initial deployment may involve creating namespaces, image pull secrets, persistent volumes, deployments, services, and ingress resources. Images are pushed to and pulled from an Alibaba Cloud repository via VPC, avoiding public network latency. After completion the runner pod is destroyed and GitLab returns the result.

Note: the resource manifests do not include ConfigMaps or Secrets for security reasons.

Our code repository uses Rancher as the multi‑cluster management platform, with security configurations handled by operations in the Rancher dashboard.

Service Deployment Logic Diagram

The diagram shows that Kong replaces Nginx for authentication, authorization, and proxying, with SLB IP bound to Kong. Jobs 0‑2 are test jobs, 3 is the build job, and 4‑7 are the pod‑change stage. Not all services require storage; decisions are made in kubernetes.sh . A unified CI template is recommended for all environments, with branch strategies as described in the referenced blog post.

Lack of Monitoring and Alerting Platform

Build a reliable federated monitoring platform tailored to our cluster environment, enabling simultaneous monitoring of multiple clusters and proactive fault alerts.

Monitoring and Alerting Logic Diagram

The solution combines Prometheus, custom shell or Go scripts, and Sentry. Alerts are sent via enterprise WeChat or email. Three colored lines in the diagram represent three monitoring methods. Scripts handle backup alerts, certificate alerts, and security checks. Prometheus is deployed using a custom operator with data stored on NAS. Sentry, while primarily a log collection tool, is treated as a monitoring component for business‑level error aggregation.

We employ a federated monitoring approach rather than deploying separate platforms per cluster.

Federated Monitoring Platform Logic Diagram

Because we have multiple Kubernetes clusters, a single federated monitoring system with a unified UI simplifies management. The platform provides three monitoring levels: operating system, application, and business. Traffic monitoring targets Kong using template 7424.

Missing Logging System

As the business fully migrates to Kubernetes, the need for a robust logging system grows. Kubernetes makes it difficult to retrieve logs from terminated pods.

Logging System Logic Diagram:

We propose using Elasticsearch to collect and store logs, enabling long‑term retention, visualization, and analysis. Various methods (remote storage, host‑mounted logs) can be employed, but Elasticsearch offers the best balance of searchability and scalability.

Severe Lack of Operational Documentation

Establish a documentation hub centered on Yuque for all operational materials, scripts, and procedures, ensuring easy access while maintaining security controls.

Documentation should be concise yet contain core steps; every operation must be recorded for both personal reference and team knowledge sharing.

Unclear Request Routing

Redesign traffic routing across clusters to provide integrated authentication, authorization, proxying, connection, protection, control, and observability, thereby limiting fault propagation.

Request Routing Logic Diagram:

Clients access the site via Kong gateway, which authenticates and forwards requests to the appropriate namespace. Microservices communicate through Istio for mutual TLS, while database and storage interactions are routed to the respective resources.

Conclusion

By implementing a Kubernetes‑centric CI/CD release pipeline, a Prometheus‑based federated monitoring and alerting platform, an Elasticsearch logging system, a Yuque‑based documentation center, and integrated traffic management with Kong and Istio, the cluster can achieve high availability and reliability.

Overall Architecture Diagram:

The diagram, though complex, can be understood by tracing colored lines that represent different modules. The proposed solution should stabilize the clusters; further enhancements may include adding Redis caching, Kafka or RQ middleware where needed.

monitoringci/cdoperationsKubernetesDevOpslogging
Sohu Tech Products
Written by

Sohu Tech Products

A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.