How Zuoyebang Cut 22% Costs with Kubernetes Serverless Virtual Nodes
Zuoyebang’s shift to cloud‑native architecture leveraged Alibaba Cloud’s Kubernetes Serverless virtual nodes, achieving a 22.5% cost reduction during peak traffic by dynamically scaling workloads, while addressing scheduling, observability, and performance challenges through custom schedulers, enhanced monitoring, and careful testing.
Background
Zuoyebang’s backend technology stack is moving toward cloud‑native, with resource utilization being a core goal. Serverless offers elastic scaling, strong isolation, pay‑as‑you‑go billing, and automated operations, providing faster delivery, lower risk, and reduced infrastructure and labor costs. The long‑term Serverless strategy includes two options: function compute and Kubernetes Serverless virtual nodes (e.g., Alibaba Cloud ECI).
In 2020 the team began experimenting by moving some compute‑intensive jobs to Serverless virtual nodes to isolate workloads. In 2021, scheduled tasks were shifted to these nodes to replace manual scaling for short‑lived jobs, improving resource usage. By 2022, core online services—highly latency‑sensitive and with pronounced traffic peaks—were migrated, demanding that performance and stability on virtual nodes match that of physical servers.
Kubernetes Serverless Virtual Nodes
A virtual node is not a physical machine but a scheduling capability that allows pods in a standard Kubernetes cluster to be placed on resources outside the cluster’s own servers. Pods on virtual nodes retain the same security isolation, network isolation, and connectivity as on bare‑metal servers, while benefiting from on‑demand provisioning and usage‑based billing.
Cost Advantage
Most of Zuoyebang’s services are containerized, and online traffic exhibits short, intense peaks (about 4 hours per day). During peaks, server utilization reaches ~60%; off‑peak it drops to ~10%. This pattern suits Serverless elasticity. Assuming the hourly cost of owned servers is C, a full‑day cost is 24C. If Serverless costs 1.5C per hour, the following simple calculation shows the savings:
Total cost with only owned servers: C × 24 = 24C
Keep 70% of owned servers and add Serverless for 30% of peak capacity: C × 24 × 0.7 + 1.5C × 4 × 0.3 = 18.6C
Theoretical peak‑time savings: (24C − 18.6C) / 24C ≈ 22.5% . Thus, elastic scheduling of peak workloads to Serverless can dramatically lower resource costs.
Problems and Solutions
Scheduling and Control Issues
The scheduler must decide (1) which pods to place on virtual nodes during scale‑up and (2) which pods to evict from virtual nodes first during scale‑down. Existing Kubernetes versions lack built‑in support for these policies.
Scale‑up strategy: The infrastructure team defines a threshold for the maximum number of pod replicas that can run on physical nodes based on observed peak demand. Pods exceeding this threshold are automatically scheduled to virtual nodes by a custom scheduler.
Scale‑down strategy: Pods on virtual nodes receive a custom annotation. The kube-controller-manager is patched to prioritize eviction of pods with this annotation, ensuring that cheaper, reserved‑instance resources are used first.
The custom scheduler also integrates with the DevOps platform, allowing operators to set thresholds manually, combine with HPA/cron‑HPA, and perform one‑click isolation of virtual nodes in failure scenarios.
Observability Issues
Monitoring, logging, and tracing services are self‑built. Because virtual nodes run monitoring agents provided by the cloud vendor, the team had to bridge data back to their own observability stack.
Monitoring: Virtual nodes expose standard kubelet metrics, allowing seamless Prometheus scraping of CPU, memory, disk, and network usage.
Logging: A CRD config forwards logs from virtual nodes to a central Kafka pipeline, where a custom consumer normalizes logs from both cloud‑provider and self‑hosted nodes.
Distributed Tracing: Since a daemonset‑based Jaeger agent cannot run on virtual nodes, the Jaeger client is modified to detect the environment via an environment variable and send traces directly to the Jaeger collector.
Performance, Stability, and Other Concerns
Performance variance: Different underlying hardware and virtualization overhead may cause latency differences; latency‑sensitive services must be benchmarked before migration.
Cloud inventory risk: During massive scale‑up, specific instance types may be unavailable; the system falls back to the next larger spec (e.g., 2c2G → 2c4G).
Debugging difficulty: Virtual nodes are managed by the cloud provider, limiting direct access to system logs or core dumps. Alibaba Cloud ECI now supports automatic core‑dump upload to OSS to mitigate this.
Scale and Benefits
The solution is now production‑ready, handling nearly ten thousand CPU cores of core online traffic on Alibaba Cloud ACK + ECI virtual nodes. As business volume grows, the scale on Serverless virtual nodes will expand further, delivering substantial ongoing cost savings.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
