How Kaola Achieved Rapid Cloud‑Native Migration: Strategies, Challenges, and Lessons
This article details Kaola's cloud‑native transformation from 2019 to 2024, covering product integration, permission and messaging schemes, RPC migration, SchedulerX scaling, environment isolation, high‑availability components, Infrastructure‑as‑Code practices, automation strategies, and the measurable performance and cost benefits realized.
Background
Kaola’s cloud‑native migration started in October 2019 with a four‑month deadline. The goal was rapid migration, making cloud‑native the optimal path.
Cloud‑Native Definition
Cloud‑native is a technology stack and methodology that embraces containers, continuous delivery, orchestration, open‑source, and micro‑service principles. It requires changes in architecture, development, and operations to produce portable applications for public, private, or hybrid clouds.
Alibaba Cloud provides native middleware such as RocketMQ, Kafka, ARMS, MSE, AHAS, PTS, and Function Compute, which form the foundation for Kaola’s migration.
Migration Phases
Phase 1 (Oct 2019 – Mar 2020) : Integrated databases, Redis, and ASI with minimal code changes.
Phase 2 : Adopted MSE for one‑stop micro‑service governance, eliminating frequent crashes.
Phase 3 : Aligned with group platforms to support major sales events (e.g., Double 11).
Access Strategy
Permission Scheme
Initially a single RAM sub‑account was shared by all developers for simplicity, resulting in coarse‑grained permissions. To tighten security, Kaola wrapped RAM‑based STS tokens with additional policies per user and integrated the tokens with SchedulerX via RoleSessionName to enforce per‑application access.
Message Scheme
The existing Kafka and RabbitMQ ecosystem was migrated to RocketMQ. RocketMQ offered drop‑in compatibility, higher performance, message tracing, and query capabilities.
RPC Scheme
Kaola replaced its custom Dubbo branch (Dubbok) and Nvwa registry with the group’s HSF + ConfigServer stack, using Alibaba Cloud EDAS extensions for Dubbo compatibility. After a month of functional testing and performance tuning, the new stack handled Double 11 traffic without issues.
SchedulerX Scheme
To migrate >13 000 scheduled tasks from the legacy kschedule platform, Kaola built a synchronization tool and a cloud‑native control platform to transfer task definitions and permissions, automating the migration and reducing manual effort.
Environment Isolation
Kaola used a “main‑branch + project‑environment” routing strategy, later mirrored by Alibaba Cloud SCM routing. SCM plugins were ported to Dubbok, enabling seamless environment switching without code changes.
High‑Availability Components
AHAS replaced the legacy NFC component, providing annotation, API, filter integrations, and JavaAgent‑based injection without code changes. AHAS added system‑load protection, circuit breaking, real‑time monitoring, and DingTalk alerts.
Infrastructure as Code (IaC)
Key Practices
Build‑Deploy System : Integrated AppStack and IaC with GitOps; all build, deployment, and static configuration are stored in source repositories for version consistency.
Lightweight Containers : Refactored container images to match group standards, separating admin and application containers and adopting lightweight containers for better isolation.
CPU‑Share Mode : Switched from CPU‑set (fixed CPU binding) to CPU‑share, allowing containers to use any CPU within the same NUMA node, improving stability and utilization.
Image‑Config Separation : Decoupled container images from static and release configurations, enabling image reuse across environments and simplifying rollbacks.
Implementation Strategy
Automation : Used service.cue templates to generate configuration and environment definitions, reducing migration time to ~1 minute per application.
Support Model : Early phases employed a dedicated on‑site support team with daily training; later phases shifted to a self‑service model with API‑driven pipeline creation.
Results
No major incidents during peak events (618, Double 11) despite extensive migration.
Full alignment with group deployment standards, achieving seamless inter‑service communication.
Stateful container deployments saved ~100 seconds per batch; CPU‑share reduced CPU usage to <55 % during peak load.
Decommissioned 250 servers; capacity scaling for major events completed in <0.5 person‑day.
Enhanced cloud product capabilities and resolved security/accounting issues.
Future Direction
Kaola plans to explore Service Mesh to further abstract distributed system complexities, accepting the associated performance overhead to achieve language‑agnostic, infrastructure‑transparent services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
