Cloud Native 19 min read

Vivo’s Cloud‑Native Container Practices: High‑Availability, Automation, and Platform Evolution

Vivo’s cloud‑native journey, detailed from its 2018 machine‑learning pilot to a large‑scale container ecosystem, showcases how high‑availability design, automated multi‑cluster operations, CI/CD pipelines, and unified traffic ingress have dramatically improved efficiency, reduced costs, and enabled rapid, scalable AI‑driven services across the business.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Vivo’s Cloud‑Native Container Practices: High‑Availability, Automation, and Platform Evolution

Based on Pan Liangbiao’s talk at the 2022 Vivo Developer Conference, this article summarizes Vivo’s exploration and implementation of cloud‑native container technologies, focusing on high‑availability, automated operations, platform upgrades, and ecosystem integration.

Since 2018, Vivo has built a one‑stop cloud‑native machine‑learning platform on top of containers, supporting algorithm middle‑platform services such as data management, model training, and deployment for advertising, recommendation, and search. The success of this pilot led to a strategic upgrade toward a large‑scale, cost‑effective, cloud‑native container ecosystem.

1. Container Technology and Cloud‑Native Concepts

Containers have evolved from the Unix chroot (1979) through four stages: emergence, burst, commercial exploration, and expansion. Compared with virtual machines, containers offer lower overhead, faster startup, better resource utilization, and superior scalability.

Cloud‑native is defined by two main viewpoints: Pivotal (DevOps, continuous delivery, micro‑services, containers) and CNCF (supporting core components such as Kubernetes and Prometheus). Core technologies: containers, micro‑services, service mesh. Core principles: immutable infrastructure and declarative APIs.

2. Value Analysis

From efficiency, cost, and quality perspectives, cloud‑native and containers provide:

Efficiency: rapid continuous delivery, portable images, elastic scaling.

Cost: on‑demand resource allocation, high scheduler utilization, reduced fragmentation.

Quality: observability, self‑healing, manageable clusters.

3. Vivo’s Container Exploration and Practice

2.1 Pilot Exploration

Starting in 2018, Vivo built a cloud‑native machine‑learning platform on containers, delivering end‑to‑end capabilities for recommendation, advertising, and search. The platform offers five advantages: full‑scene coverage, short queue time (P99 < 45 min), low cost (CPU utilization > 45 %), high efficiency (training 830 M samples/hour), and superior results (training success rate > 95 %).

2.2 Value Mining

Containers helped reduce costs (CPU utilization improvement from ~25 % to industry‑level 40‑50 %) and increase efficiency (addressing middleware upgrades, migration, testing, traffic spikes, and global deployment consistency).

2.3 Strategic Upgrade

Vivo upgraded its internal strategy to build a first‑class container ecosystem based on cloud‑native principles, adding unified traffic ingress, container operation platforms, naming services, and monitoring.

2.4 Challenges

Key challenges include rapid cluster scale growth (10 k+ hosts, 10 k+ instances), operational standardization, monitoring pressure, and seamless Kubernetes version upgrades. Platform challenges involve IP changes, ecosystem compatibility, user habits, and quantifying operational benefits.

2.5 Best Practices

2.5.1 High‑Availability : Fault prevention (process tools, disaster recovery, infrastructure), fault detection (monitoring dashboards, inspections), and fault recovery (playbooks, post‑mortems).

2.5.2 Automated Operations : Multi‑cluster management platform with standardized configuration, white‑screen operation, and audit logs.

2.5.3 Architecture Upgrade : Four‑layer architecture – container + K8s base, IAAS integration, platform services (online, middleware, big data, AI training), and business enablement.

2.5.4 Capability Enhancements : OpenKruise workload extensions, lossless service release, Harbor image security, Dragonfly2 image acceleration, fixed‑IP support, Karmada multi‑cluster management.

2.5.5 CI/CD Integration : Jenkins + Spinnaker pipeline – code checkout, build, security scan, image push, API‑driven deployment.

2.5.6 Unified Traffic Ingress : Migration from Nginx to APISIX to handle massive container‑driven traffic and IP churn.

2.6 Outcomes

Product capability matrix now covers four layers (basic services, core capabilities, platform CI/CD, business layer) and supports 600+ online services, 500+ algorithm services, 20+ big‑data clusters, and extensive AI training workloads.

2.7 Summary

Four dimensions of reflection: finding value, defining strategy, building platforms, and seeking breakthroughs. The overall message emphasizes technology serving business, with cost‑optimal, efficient solutions.

3. Future Outlook

Vivo envisions three directions: full containerization, embracing cloud‑native, and offline mixed deployment. The goal is “write once, run everywhere” with extreme efficiency and cost‑optimal operations.

cloud-nativePlatform Engineeringautomationhigh availabilitykubernetesContainer
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.