Cloud Native 24 min read

Scaling Cloud‑Native Containers at DeWu: Multi‑Cluster Management and Cost Optimization

This article details DeWu's cloud‑native transformation since August 2021, covering multi‑cluster federation, application profiling, custom scheduling plugins, resource pre‑reservation, co‑location of online and offline workloads, cost‑saving hardware choices, multi‑cloud strategy, and the development of the KubeAI platform for AI scenarios.

dbaplus Community

Apr 1, 2024

Scaling Cloud‑Native Containers at DeWu: Multi‑Cluster Management and Cost Optimization

Introduction

DeWu App's rapid growth required an efficient cloud‑native infrastructure. Starting in August 2021 the team pursued high availability, observability, and operational efficiency while keeping costs under control. The article summarizes the solutions and practices applied during this transformation.

Cloud‑Native Application Management

Management Model

Adopted an OAM‑style abstraction: an “application cluster” maps to a Kruise CloneSet, each Pod is an instance, and “application routing” maps to Ingress/Service. Configuration and feature layers are rendered with Helm to produce Kubernetes resources, simplifying CI/CD and middleware management.

Sidecar containers also handle permission management, mirroring ECS user login rights.

Multi‑Cluster Management

Implemented federation (Karmada/KubeAdmiral) to avoid single‑cluster failure. Host clusters use PropagationPolicies and OverridePolicies to control workload distribution, while Member clusters run a custom MCS‑Controller and MCS‑Validator to keep Service/Endpoint objects consistent across clusters.

Container Scheduling Optimization and Co‑Location

Application Profiling

Historical resource usage is collected via Prometheus. A custom KubeRM service computes a profile value (Pod Request = utilization / safety water‑mark) for CPU, memory, and GPU. These values guide resource specifications for new workloads.

Profiles automatically applied to P3/P4 services.

For other services, profiles are recommended for user acceptance.

Different resource pools can enforce distinct activation strategies.

GPU memory profiles are only recommended, not auto‑applied.

Pricing differentiates between profile‑driven Request billing and non‑profile Limit billing, encouraging users to adopt the recommended values.

Resource Pre‑Reservation

A custom scheduler plugin defines reservation intents via CRDs, preventing high‑priority pods from being blocked by frequent updates or burst scaling.

Balanced Scheduling

Implemented four plugins:

CoolDownHotNode : lowers priority of nodes that recently scheduled pods to avoid hot spots.

HybridUnschedulable : blocks pods using elastic resources from being scheduled on certain nodes.

NodeBalance : balances each node’s CPU request against its profile value.

NodeInfoRt : incorporates real‑time scoring data into scheduling decisions.

Real‑Time Co‑Location

Mixed online services with Flink offline tasks using dedicated BE‑CPU/BE‑Memory resources and binding strategies (LSX, LSR, LS, BE). The binding table defines four application types and their CPU core allocation policies.

Offline Co‑Location (Phase 2)

Introduced “OT” resources to over‑commit BE resources for AI training and data‑processing tasks. Safeguards include host safety water‑mark, CPU‑group priority (offline tasks always lower than online), isolated disks, and night‑time auto‑scaling to free memory for offline workloads.

Elastic Scaling

Developed the KubeAutoScaler component to unify HPA, VPA, and scheduled scaling policies. It collaborates with the profiling system to down‑scale low‑traffic services at night, releasing resources for offline tasks. GPU services use a Queue‑Proxy sidecar to trigger scaling based on traffic thresholds, with an Activator handling cold‑start scaling.

Resource and Cost Governance

Machine Model Replacement

Switched inference from V100 GPUs to cost‑effective A10 GPUs, cutting inference cost by ~20% and improving CPU performance. CPU‑intensive services were migrated from Intel to AMD CPUs, reducing CPU cost by ~14%.

Resource Pool Management

Controlled redundancy based on release cycles, merged clusters by region and purpose, consolidated similar resource pools, and performed fragmentation cleanup through pod re‑scheduling and host re‑allocation.

Workload Specification Governance

Standardized resource specifications: predefined CPU‑memory ratios for CPU workloads and CU units for GPU workloads, with differential billing to align cost with actual usage.

Self‑Built Products

Developed the KubeAI platform to host model training, reducing reliance on external cloud services and enabling unified management of AI workloads.

Multi‑Cloud Strategy

Adopted a multi‑cloud approach to mitigate GPU shortages, improve bargaining power, and meet compliance requirements. Considerations include cross‑region service access, middleware dependencies, and data‑transfer costs.

Cloud‑Native AI Scenario

KubeAI provides end‑to‑end model development, training, inference, and version management, and now offers AIGC/GPT services to accelerate business outcomes.

Outlook

Future work includes further containerizing middleware, refining co‑location and elastic capacity solutions, enhancing Kubernetes stability, and expanding multi‑cloud capabilities to keep the infrastructure flexible and robust as the business scales.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Multi-Cluster resource optimization AI platform cost governance container scheduling

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Introduction

Cloud‑Native Application Management

Management Model

Multi‑Cluster Management

Container Scheduling Optimization and Co‑Location

Application Profiling

Resource Pre‑Reservation

Balanced Scheduling

Real‑Time Co‑Location

Offline Co‑Location (Phase 2)

Elastic Scaling

Resource and Cost Governance

Machine Model Replacement

Resource Pool Management

Workload Specification Governance

Self‑Built Products

Multi‑Cloud Strategy

Cloud‑Native AI Scenario

Outlook

dbaplus Community

How this landed with the community

Was this worth your time?

0 Comments

Offline Co‑Location (Phase 2)