Design and Implementation of Ctrip's Virtual Cloud Desktop System Based on OpenStack
This article presents Ctrip's deployment of a virtual cloud desktop system for its call center, detailing the OpenStack‑based architecture, advantages over traditional PCs, challenges encountered, the evolution to a decoupled design, resource over‑commit strategies, networking issues, and the operational tools and automated testing that ensure stability.
Ctrip's call center serves tens of thousands of agents 24/7, and to improve operational efficiency it replaced traditional desktop PCs with a virtual cloud desktop solution. The solution uses thin clients that connect to virtual machines managed by OpenStack, with QEMU/KVM virtualization and a customized SPICE protocol.
The cloud desktop brings several benefits: rapid automated provisioning (a VM can be delivered in five minutes), faster fault handling (issues resolved within five minutes), centralized management via an automation portal, lower power consumption, and overall reduced carbon footprint.
Initially, the architecture tightly coupled business logic with OpenStack Nova, using Keystone for authentication and Horizon for management. This design caused limitations such as difficult OpenStack upgrades, mandatory Keystone user accounts, and reliance on third‑party remote desktop protocols.
To overcome these issues, a new architecture was introduced that decouples business logic from OpenStack. A VMPool maintains pools of pre‑configured VM specifications, while an Allocator service matches user requests to available VMs based on LDAP‑derived user attributes. Management moved to a custom IT‑operations portal, replacing Horizon.
During large‑scale deployment, Ctrip selected stable versions of KVM, QEMU, OpenVSwitch, kernel, and libvirt after extensive 24/7 automated testing. Resource over‑commit ratios were tuned (memory ~1:1.2, CPU up to 1:2, careful I/O limits) to avoid OOM crashes and performance degradation.
Network challenges included multiple dnsmasq instances causing DHCP lease renewal failures, ordering problems between libvirt and OpenVSwitch on host reboot leading to VM network loss, and RabbitMQ long‑connection drops mitigated by enabling TCP keep‑alive.
Operational stability is reinforced by a dual‑SaltStack based maintenance system, a portal for visual monitoring, automated software installation, asset tracking for thin clients, and extensive business‑level monitoring (e.g., active user input events). Automated testing runs 24/7 in a dedicated lab, supplemented by CI‑driven unit and integration tests.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.