How Serverless Kubernetes Virtual Nodes Cut Costs and Boost Scalability
Zhang's team at Zuoyebang details their journey to serverless Kubernetes virtual nodes, explaining how elastic scaling, fine-grained scheduling, and cost‑effective resource utilization transformed high‑peak online services, while addressing challenges in scheduling, observability, performance, and multi‑cloud resilience.
Background
Zuoyebang's backend technology is moving toward cloud‑native architecture, aiming to improve resource utilization by running more application instances on fewer compute nodes. Serverless offers elastic scaling, strong isolation, pay‑per‑use billing, and automated operations, which reduce delivery time, risk, infrastructure cost, and labor.
Serverless adoption has been a core focus, with two main approaches: function computing and Kubernetes Serverless virtual nodes.
Kubernetes Serverless virtual nodes provide the same experience as running on physical machines, allowing seamless migration and scheduling, exemplified by Alibaba Cloud ECI.
Because Zuoyebang's business requires fine‑grained resource management, the team developed a custom solution on top of cloud‑provider capabilities.
2020‑2022 Milestones
2020: Intensive compute workloads were shifted to Serverless virtual nodes to leverage strong isolation.
2021: Scheduled cron jobs to Serverless virtual nodes, replacing node scaling for short‑term tasks and improving resource efficiency.
2022: Core online services, which are latency‑sensitive, were migrated to Serverless virtual nodes, achieving significant cost savings during peak traffic while maintaining performance parity with physical servers.
1. Kubernetes Serverless Virtual Nodes
Virtual nodes are not physical machines but a scheduling capability that allows pods from a standard Kubernetes cluster to run on external resources. Pods on virtual nodes retain the same security isolation, network isolation, and connectivity as on bare‑metal servers, while offering on‑demand, pay‑per‑use provisioning.
2. Cost Advantages
Most of Zuoyebang's services are containerized. Online traffic exhibits short, intense peaks (≈4 hours per day) with average server load around 60 % during peaks and 10 % off‑peak. Assuming a baseline cost C per hour for owned servers, and a Serverless cost of 1.5 C per hour, the total daily cost drops from 24 C to 18.6 C, a 22.5 % reduction.
Problems and Solutions
Key challenges include scheduling and control, observability, and performance/stability.
1. Scheduling & Control
Two main issues: deciding which pods should be placed on virtual nodes during scale‑out, and ensuring pods on virtual nodes are preferentially terminated during scale‑in. Existing Kubernetes versions lack these capabilities.
Scale‑out strategy: When physical node utilization reaches a threshold, excess pods are scheduled to Serverless virtual nodes, achieving capacity without risk and with cost benefits.
Scale‑in strategy: Pods on virtual nodes are annotated so that the kube‑controller‑manager prioritizes them for termination, reducing cost by shrinking higher‑priced Serverless resources first.
The DevOps platform supports automatic threshold calculation, manual adjustments, integration with HPA and Cron‑HPA, one‑click node isolation for failure scenarios, and multi‑cloud scheduling.
2. Observability
Custom monitoring, logging, and tracing services are used. Virtual nodes expose standard kubelet metrics, allowing seamless Prometheus integration. Logs are collected via CRD and forwarded to Kafka, then processed by an internal log consumer. For distributed tracing, the Jaeger client skips the agent on virtual nodes and sends data directly to the collector.
3. Performance, Stability & Other Issues
Virtual nodes may have performance differences due to underlying hardware and virtualization overhead, requiring thorough testing for latency‑sensitive workloads.
During peak scaling, cloud provider inventory limits can cause resource shortages; automatic instance type upgrades mitigate this.
Debugging is harder because virtual nodes run in the provider’s pool; core dumps are now auto‑uploaded to OSS for analysis.
Scale and Benefits
The solution is production‑ready, supporting nearly ten thousand cores of core online services during peak periods on Kubernetes Serverless virtual nodes. As traffic grows, the scale on virtual nodes will expand further, delivering substantial resource cost savings.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
