Kubernetes Architecture, Multi‑Cluster Management, and Application Lifecycle Practices at Yiche
This article describes Yiche's adoption of Kubernetes, detailing its master‑node architecture, multi‑cluster strategies, custom container management platform, CI/CD pipeline, monitoring with Prometheus, logging, auditing, health checks, and future plans to streamline traffic and control planes.
Kubernetes Features
Container orchestration technologies accelerate application delivery, provide lightweight deployment, and enable elastic scaling, delivering optimal value for enterprises. Kubernetes, built on Google’s container experience, is an open‑source platform ready for production use. Yiche combines its own business characteristics with Kubernetes to explore a suitable containerization path.
1.1 Kubernetes Architecture
Master components (master node):
kube‑apiserver – handles communication between components, connects to etcd and other components.
kube‑controller‑manager – runs various controllers to drive the cluster toward the desired state.
scheduler – scores nodes and assigns Pods.
etcd – distributed key‑value store that holds all cluster data.
Node components (node node):
kube‑proxy – provides Pod access using kernel IPVS.
kubelet – manages the lifecycle of Pods on the node.
Docker – runs container lifecycles.
Multi‑Cluster Management
2.1 Considerations
• Avoid a single point of failure ("do not put all eggs in one basket").
• Hybrid architecture: public cloud (Tencent Cloud TKE) + self‑built data‑center cloud.
2.1.1 Egg‑basket principle
Kubernetes limits each node to 100 Pods and supports up to 5,000 nodes per cluster. While Yiche currently does not hit these limits, growing node counts increase failure risk; multi‑cluster design isolates faults.
2.1.2 Cloud‑on‑cloud‑off
Yiche runs a self‑built Kubernetes cluster alongside Tencent Cloud TKE. During traffic spikes, the cloud side provides rapid elastic scaling, expanding capacity several‑fold.
Both clouds host a three‑master‑node business cluster; users can choose the deployment target, and cross‑cluster migration is supported.
Each cluster exposes a VIP that connects to the company‑wide Layer‑7 load balancer; users select the cluster VIP as the upstream.
2.2 Multi‑Cluster Network Mode
Initially, Calico was used for container networking, but service discovery and direct pod‑to‑pod communication caused issues during migration. kube‑ovn was adopted to expose IPs, solving pod IP routing across clusters without extra gateways or proxies.
2.3 Container Management Platform
Manual kubectl context switching is error‑prone. Yiche uses a self‑developed container management platform together with Rancher to centralize cluster state, avoid context mistakes, and provide a unified UI.
Custom features include container‑IP lookup, label management for nodes and applications, and automated node onboarding.
Application Lifecycle Management
3.1 CI/CD Pipeline
Initially developers wrote raw YAML files, which became cumbersome as the number of applications grew. Yiche built a CI/CD pipeline using the Kubernetes client‑go SDK for the CD stage, enabling centralized lifecycle management.
Traefik is deployed as a DaemonSet to serve as the Layer‑7 entry point; its IPs are integrated into the company load balancer VIP, allowing seamless scaling without manual upstream updates.
3.2 Application Configuration
Instead of hand‑written YAML, applications are defined as workloads in the platform UI (see Fig. 3) and released to each cluster via the pipeline.
After deployment, instance management shows runtime status (Fig. 4).
3.3 Alerting and Log Collection
3.3.1 Alerting
Prometheus and the Prometheus Operator are used to collect Kubernetes metrics. Key CRDs include:
ServiceMonitor – selects Service endpoints via labels for metric scraping.
PrometheusRule – defines custom aggregation rules.
Prometheus – runs as a StatefulSet, pushes data to a central monitoring system.
Each cluster runs its own Prometheus instance; when Series count exceeds ~3 million, memory usage spikes, so Yiche trims unimportant Series and reduces retention periods.
3.3.2 Log Collection
The platform provides “kubectl exec” and “kubectl logs”‑like interfaces. An ARK log system, deployed as a DaemonSet agent on each node, collects container logs according to rules from a configuration center and forwards them to a central log service for storage, visualization, analysis, filtering, and alerting.
Auditing, Event Recording, and Health Checks
4.1 Auditing
User actions inside containers are recorded for audit purposes (Fig. 8).
4.2 Event Persistence and Health Checks
Since etcd only retains events for about one hour, Yiche stores events in Elasticsearch for long‑term analysis.
Health‑check components report the status of resources, workloads, and configurations, providing anomaly descriptions, severity, root cause, impact, and remediation suggestions.
Future Outlook
5.1 Shortening the External‑to‑Internal Access Path
Current flow: external → company Layer‑7 load balancer → cluster VIP → Traefik → service, which adds latency and failure points. Two proposals:
Remove the intermediate VIP; let Traefik nodes auto‑register to the upstream load balancer when scaling.
Eliminate both VIP and Traefik, using an Ingress controller that directly registers Pod IPs to the Layer‑7 load balancer, separating data and control planes.
5.2 Multi‑Cluster Management Model
Presently, applications are dispatched to a chosen cluster; migration requires manual steps. The goal is a top‑level control plane that only schedules deployments, leaving clusters untouched; interactions happen with the control plane objects, improving resilience and user experience.
Conclusion
The article outlines Yiche’s current Kubernetes deployment, challenges arising from growing cluster scale, and ongoing explorations of technologies that better serve users.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Yiche Technology
Official account of Yiche Technology, regularly sharing the team's technical practices and insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
