Cloud Native 10 min read

Using Tencent Cloud EKS Virtual Nodes to Solve CronJob Isolation and Scheduling Challenges

By offloading thousands of short‑lived CronJob pods to Tencent Cloud EKS serverless virtual nodes, Zuoyebang isolated them from online services, eliminated IP waste, achieved millisecond‑level parallel scheduling and sub‑3‑second startup, freed 10 % of cluster resources and cut scheduling costs by roughly 70 % while markedly improving cluster stability.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Using Tencent Cloud EKS Virtual Nodes to Solve CronJob Isolation and Scheduling Challenges

Recently, the Tencent Cloud Middleware team together with the StreamNative community released RoP 0.2.0, which upgrades the architecture to completely avoid message loss, duplicate consumption, and partial‑partition consumption problems.

Background : In the process of cloud‑native containerization at Zuoyebang, the cluster size grew and mixed deployment scenarios became more complex. Large numbers of CronJob pods and online services share the same production cluster, leading to increasing cluster‑level issues.

Problem 1 – Node Stability : Frequent creation and deletion of thousands of pods per minute cause a massive number of cgroup entries. Memory cgroup statistics (/sys/fs/cgroup/memory/memory.stat) become slow to read, CPU spends a lot of time in kernel mode, and network latency spikes. The memcg_stat_show function traverses a huge memcg tree, making the delay catastrophic.

Problem 2 – Resource Utilization : Using TKE VPC‑CNI limits the number of pod IPs per node. Almost half of the IPs are reserved for CronJob pods, leading to IP waste and low overall resource utilization. Short‑lived CronJob pods also leave many reserved resources idle.

Additional issues include slow serial scheduling of thousands of jobs at midnight, and interference between compute‑intensive CronJob pods and online services due to incomplete cgroup isolation.

Solution – EKS Virtual Nodes : Tencent Cloud Elastic Kubernetes Service (EKS) provides serverless virtual nodes that can be added to the existing TKE cluster. Pods scheduled on virtual nodes have the same network connectivity as normal pods but run in isolated VMs, eliminating the need for pre‑reserved resources and enabling pay‑as‑you‑go usage.

All CronJob workloads are dispatched to virtual nodes, achieving isolation from online services while still allowing inter‑service communication.

Task Scheduler : To overcome the default serial scheduler, a custom task scheduler was built to batch‑parallel schedule CronJob pods onto virtual nodes, achieving millisecond‑level scheduling for large‑scale pod tasks and falling back to standard TKE nodes when necessary.

Operational Alignment : Differences between virtual and standard nodes were addressed to make the migration transparent to developers. Unified log collection is achieved by using the EKS‑provided log agent that forwards container stdout to a Kafka topic. Monitoring is unified via Prometheus scraping of sandbox export interfaces, providing identical CPU, memory, disk, and network metrics for virtual node pods.

Performance Boost : Virtual node pods achieve second‑level startup times. Two main latency sources were optimized: image pull acceleration via EKS image caching, and fast pod creation/initialization on virtual nodes. This results in sub‑3‑second startup variance for time‑critical jobs.

Results : By separating CronJob and online workloads with TKE + EKS virtual nodes, cluster stability improved, 10 % of cluster resources were freed, and the cost of scheduled tasks dropped by about 70 %. The solution also delivered millisecond‑level pod scheduling and startup, meeting strict timing requirements.

Cloud NativeKubernetestask schedulingresource utilizationvirtual nodescluster stabilityCronJob
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.