How Qunar Mastered Cloud‑Native Containerization: Lessons, Challenges, and Solutions
This article chronicles Qunar's multi‑year journey to adopt cloud‑native containerization, detailing the timeline, architectural redesign, CI/CD overhaul, middleware adaptations, migration strategies, encountered issues, and future plans for stability, resource efficiency, and service‑mesh implementation.
Background
In recent years cloud‑native and container technologies have become popular and mature. Qunar began its containerization journey at the end of 2020 to leverage the scalability, elasticity, portability and resilience offered by cloud‑native architectures.
Qunar Containerization Development Timeline
2014‑2015: Teams experimented with Docker and Docker‑Compose for integration environments, but limited orchestration prevented wider adoption.
2015‑2017: Ops migrated Elasticsearch clusters from Mesos to Kubernetes, improving operational efficiency.
2018‑2019: MySQL containers were introduced; Docker on host allowed a MySQL instance to be provisioned in about 10 seconds.
2020‑2021: Over 300 P1/P2 applications were containerized, with a goal to complete all services by the end of 2021.
Implementation and Practices
The containerization effort required adaptations across the Portal platform, middleware, ops infrastructure and monitoring. The resulting architecture matrix includes:
Portal: PaaS entry point providing CI/CD, resource management, self‑service operations and application authorization.
Ops tools: watcher, bistoury (Java online debug), qtrace, Loki/ELK for observability.
Middleware: MQ, configuration center, distributed scheduler (qschedule), Dubbo, MySQL SDK, etc.
Virtualized clusters: underlying Kubernetes and OpenStack.
Noah: test‑environment management platform supporting mixed KVM/container deployment.
CI/CD Process Refactoring
Key changes include consolidating runtime configuration, whitelist and release parameters into a unified declarative application profile, centralizing authorization through a single entry point, and adopting KubeSphere as the multi‑cluster solution.
Application profile: unified declarative configuration for container releases.
Authorization system: automated, single‑point authorization.
K8s multi‑cluster: KubeSphere selected after performance evaluation.
Middleware Adaptation
Because container IPs change frequently, all public components and middleware were modified to accept dynamic addresses.
Application Smooth Migration Design
Guidelines and automated tests were defined to ensure stateless applications without post_offline hooks or warm‑up steps could be migrated quickly.
Pre‑conditions: applications must be stateless and lack post_offline hooks or warm‑up URLs.
Test‑environment validation: automatic SDK upgrade and pom file modification during compilation, with failure notifications.
Online validation: initial release without traffic, followed by automated testing before traffic is enabled.
Mixed KVM/container deployment: keep both alive for a period, then phase out KVM after verification.
Full online release: decommission KVM after confirming service stability.
Observation: monitor for a period before reclaiming KVM resources.
Issues Encountered During Containerization
Compatibility with legacy KVM hooks
K8s only provides preStop and postStart hooks, which did not satisfy the custom preStart and preOnline hooks required by KVM workflows.
preStart hook: injected into the container entrypoint to execute user‑defined scripts.
preOnline hook: implemented via postStart hook that polls health status and runs the script once the checkurl passes.
Pod composition was redesigned accordingly (see diagram).
Missing real‑time stdout/stderr during deployment
K8s API did not expose live logs, causing timeouts. Removing the long‑running postStart hook and moving its logic to a sidecar with a shared volume restored real‑time visibility.
Concurrent image pull timeouts
When >50 pods pulled images from Harbor simultaneously, some timed out due to Harbor performance limits. A p2p solution (DragonFly + Harbor) was adopted to mitigate the issue.
Authorization service overload
High concurrency of ACL authorization calls during pod init caused the service to collapse. Mitigations included limiting init‑container retries, per‑app/IP rate‑limiting, and whitelisting common ports.
Java Debug in Container Environments
Developers trigger remote debugging via the Noah platform, which injects JVM debug flags and restarts the application. In containers, the default liveness probe killed the pod when the JVM hung at a breakpoint. The solution was to adjust the probe expression to ignore the breakpoint and to run the debug port through a sidecar using socat.
Future Outlook
Improve observability coverage to enhance APM and troubleshooting efficiency.
Apply chaos engineering to discover and eliminate stability blind spots.
Increase resource utilization through elastic scaling based on business metrics and intelligent request adjustments.
Deploy a service‑mesh built on Istio and MOSN to further increase infrastructure agility.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Qingyun Technology Community
Official account of the Qingyun Technology Community, focusing on tech innovation, supporting developers, and sharing knowledge. Born to Learn and Share!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
