Cloud Native 10 min read

How Vivo Built a Hybrid‑Cloud AI Platform with Kubernetes and ACK

This article details how vivo AI's research institute created a hybrid‑cloud AI computing platform by integrating on‑premise bare‑metal servers with Alibaba Cloud ACK, using Kubernetes, Calico, and Terway to achieve elastic GPU resources, advanced storage features, and cost‑effective scaling for deep‑learning workloads.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
How Vivo Built a Hybrid‑Cloud AI Platform with Kubernetes and ACK

Background

Hybrid cloud combines private and public cloud resources; major providers offer solutions such as AWS Outpost, Google Anthos, and Alibaba Cloud ACK. Vivo AI's platform required elastic compute capacity and advanced features, leading to a hybrid‑cloud strategy.

Solution Selection

Three implementation options were evaluated; the team chose the low‑cost option that preserved existing resource‑request processes and allowed hour‑level scaling.

The overall architecture places the Kubernetes control plane in the on‑premise data center, while worker nodes span bare‑metal servers and Alibaba Cloud VMs connected via a dedicated line. The VTraining platform can use cloud VMs transparently without modification.

Hybrid cloud architecture diagram
Hybrid cloud architecture diagram

Implementation Details

Cluster Registration

On‑premise clusters are registered to Alibaba Cloud, ensuring the VPC CIDR does not conflict with the cluster Service CIDR. An ACK Agent is deployed to maintain a long‑lived TLS 1.2 connection between the data center and Alibaba Cloud, securely forwarding console requests to the Kubernetes apiserver.

Container Network Configuration

The platform uses Calico on bare‑metal nodes and Terway on cloud nodes. NodeAffinity labels keep workloads on the appropriate node type. Three issues were resolved:

Missing /opt/cni/bin directory – fixed by changing the daemonset hostPath type to DirectoryOrCreate.

Loopback plugin not found – added an InitContainer to deploy the loopback plugin for Terway.

IP range conflicts – updated the Terway vswitches field with the missing zone’s pod virtual switch information.

Adding Cloud Nodes

Cloud VMs are provisioned via the internal cloud platform, initialized and joined to the cluster through the VContainer automation platform, then labeled for cloud‑specific scheduling.

Reducing Dedicated Line Load

Monitor cloud VM network usage and coordinate with the network team.

Throttle egress bandwidth on VM eth0 using tc.

Pre‑load training data onto VM data disks to avoid repeated pulls from the on‑premise storage cluster.

Results

When multiple business units required large‑scale GPU compute for deep‑learning training, the hybrid‑cloud solution added dozens of GPU cloud VMs to the cluster, delivering the same user experience as on‑premise resources while significantly lowering costs compared with purchasing physical machines.

Future Work

Enable AI online services to deploy on cloud VMs for burst compute needs.

Establish a streamlined request, release, and renewal workflow to improve cross‑team collaboration.

Measure and assess cloud VM cost and utilization to encourage efficient resource usage.

Automate the entire VM provisioning and cluster‑join process.

Explore advanced cloud features to boost performance of large‑scale distributed training.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Cloud NativeKubernetesAI Platformhybrid cloudcontainer networking
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.