Cloud Native 14 min read

Kubernetes Architecture, Multi‑Cluster Management, and Application Lifecycle Practices at Yiche

This article describes Yiche's adoption of Kubernetes, detailing its master‑node architecture, multi‑cluster strategies, custom container management platform, CI/CD pipeline, monitoring with Prometheus, logging, auditing, health checks, and future plans to streamline traffic and control planes.

Yiche Technology
Yiche Technology
Yiche Technology
Kubernetes Architecture, Multi‑Cluster Management, and Application Lifecycle Practices at Yiche

Kubernetes Features

Container orchestration technologies accelerate application delivery, provide lightweight deployment, and enable elastic scaling, delivering optimal value for enterprises. Kubernetes, built on Google’s container experience, is an open‑source platform ready for production use. Yiche combines its own business characteristics with Kubernetes to explore a suitable containerization path.

1.1 Kubernetes Architecture

Master components (master node):

kube‑apiserver – handles communication between components, connects to etcd and other components.

kube‑controller‑manager – runs various controllers to drive the cluster toward the desired state.

scheduler – scores nodes and assigns Pods.

etcd – distributed key‑value store that holds all cluster data.

Node components (node node):

kube‑proxy – provides Pod access using kernel IPVS.

kubelet – manages the lifecycle of Pods on the node.

Docker – runs container lifecycles.

Multi‑Cluster Management

2.1 Considerations

• Avoid a single point of failure ("do not put all eggs in one basket").

• Hybrid architecture: public cloud (Tencent Cloud TKE) + self‑built data‑center cloud.

2.1.1 Egg‑basket principle

Kubernetes limits each node to 100 Pods and supports up to 5,000 nodes per cluster. While Yiche currently does not hit these limits, growing node counts increase failure risk; multi‑cluster design isolates faults.

2.1.2 Cloud‑on‑cloud‑off

Yiche runs a self‑built Kubernetes cluster alongside Tencent Cloud TKE. During traffic spikes, the cloud side provides rapid elastic scaling, expanding capacity several‑fold.

Both clouds host a three‑master‑node business cluster; users can choose the deployment target, and cross‑cluster migration is supported.

Each cluster exposes a VIP that connects to the company‑wide Layer‑7 load balancer; users select the cluster VIP as the upstream.

2.2 Multi‑Cluster Network Mode

Initially, Calico was used for container networking, but service discovery and direct pod‑to‑pod communication caused issues during migration. kube‑ovn was adopted to expose IPs, solving pod IP routing across clusters without extra gateways or proxies.

2.3 Container Management Platform

Manual kubectl context switching is error‑prone. Yiche uses a self‑developed container management platform together with Rancher to centralize cluster state, avoid context mistakes, and provide a unified UI.

Custom features include container‑IP lookup, label management for nodes and applications, and automated node onboarding.

Application Lifecycle Management

3.1 CI/CD Pipeline

Initially developers wrote raw YAML files, which became cumbersome as the number of applications grew. Yiche built a CI/CD pipeline using the Kubernetes client‑go SDK for the CD stage, enabling centralized lifecycle management.

Traefik is deployed as a DaemonSet to serve as the Layer‑7 entry point; its IPs are integrated into the company load balancer VIP, allowing seamless scaling without manual upstream updates.

3.2 Application Configuration

Instead of hand‑written YAML, applications are defined as workloads in the platform UI (see Fig. 3) and released to each cluster via the pipeline.

After deployment, instance management shows runtime status (Fig. 4).

3.3 Alerting and Log Collection

3.3.1 Alerting

Prometheus and the Prometheus Operator are used to collect Kubernetes metrics. Key CRDs include:

ServiceMonitor – selects Service endpoints via labels for metric scraping.

PrometheusRule – defines custom aggregation rules.

Prometheus – runs as a StatefulSet, pushes data to a central monitoring system.

Each cluster runs its own Prometheus instance; when Series count exceeds ~3 million, memory usage spikes, so Yiche trims unimportant Series and reduces retention periods.

3.3.2 Log Collection

The platform provides “kubectl exec” and “kubectl logs”‑like interfaces. An ARK log system, deployed as a DaemonSet agent on each node, collects container logs according to rules from a configuration center and forwards them to a central log service for storage, visualization, analysis, filtering, and alerting.

Auditing, Event Recording, and Health Checks

4.1 Auditing

User actions inside containers are recorded for audit purposes (Fig. 8).

4.2 Event Persistence and Health Checks

Since etcd only retains events for about one hour, Yiche stores events in Elasticsearch for long‑term analysis.

Health‑check components report the status of resources, workloads, and configurations, providing anomaly descriptions, severity, root cause, impact, and remediation suggestions.

Future Outlook

5.1 Shortening the External‑to‑Internal Access Path

Current flow: external → company Layer‑7 load balancer → cluster VIP → Traefik → service, which adds latency and failure points. Two proposals:

Remove the intermediate VIP; let Traefik nodes auto‑register to the upstream load balancer when scaling.

Eliminate both VIP and Traefik, using an Ingress controller that directly registers Pod IPs to the Layer‑7 load balancer, separating data and control planes.

5.2 Multi‑Cluster Management Model

Presently, applications are dispatched to a chosen cluster; migration requires manual steps. The goal is a top‑level control plane that only schedules deployments, leaving clusters untouched; interactions happen with the control plane objects, improving resilience and user experience.

Conclusion

The article outlines Yiche’s current Kubernetes deployment, challenges arising from growing cluster scale, and ongoing explorations of technologies that better serve users.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

monitoringCloud Nativeci/cdKubernetesMulti-Clustercontainer orchestration
Yiche Technology
Written by

Yiche Technology

Official account of Yiche Technology, regularly sharing the team's technical practices and insights.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.