Cloud Native 17 min read

Adopting a Shrink‑Then‑Expand Deployment Model to Improve Release Efficiency in a Large‑Scale Travel Platform

This article analyzes the release‑time bottlenecks of a core travel platform after the post‑COVID traffic surge and presents a shrink‑then‑expand deployment strategy combined with physical‑machine container deployment, evaluating several open‑source solutions and demonstrating significant improvements in release speed, resource cost, and system stability.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Adopting a Shrink‑Then‑Expand Deployment Model to Improve Release Efficiency in a Large‑Scale Travel Platform

After the COVID‑19 pandemic, the travel industry experienced a rapid growth in traffic, causing the core application of the platform to double its release time and severely affect development efficiency. To address this, the team explored a shrink‑then‑expand deployment model that isolates each pod on a dedicated physical machine, reducing release duration and avoiding pod eviction during peak periods.

Vertical scaling of pod instances and assigning each pod to its own physical host lowered release time and eliminated eviction issues.

Switching the release model from expand‑then‑shrink to shrink‑then‑expand removed the need for extra buffer machines, cutting resource costs.

The problem analysis identified two main phenomena during high‑traffic releases: doubled release duration and two side effects—upstream interface timeouts due to increased metadata and saturated resource pools (Redis, MySQL). Increasing batch counts mitigated concurrency but introduced additional retry delays of about five minutes per failed pod.

Detailed pod lifecycle analysis showed that most failures occurred during the pre‑heat stage, either due to high‑concurrency pre‑heat timeouts or resource pressure causing Kubernetes eviction. The team considered exclusive pod placement on physical machines, but resource waste was a concern.

Solution Design

Multiple deployment options were evaluated:

KVM Deployment : No extra development cost but slow scaling due to manual approval.

Container Deployment – Expand‑Then‑Shrink : Fast rollout using existing pipelines but higher resource cost.

Container Deployment – Shrink‑Then‑Expand : Leverages HPA for elastic scaling, balancing cost and speed, though it requires additional development effort.

Argo CD Semi‑Automatic Deployment : Quick rollout but requires new operational knowledge and lacks one‑click rollback.

After discussion, the team adopted Argo CD as a temporary solution while implementing the shrink‑then‑expand mode in the container platform.

Open‑Source Tool Research

The team compared ArgoCD, Argo Rollout, Kruise, and KubeVela for supporting the shrink‑then‑expand pattern.

Argo Rollout : Provides progressive delivery (blue‑green, canary) and integrates well with Argo CD, but focuses on delivery rather than full‑stack platform features.

Kruise Rollout : Native Kubernetes solution with advanced rolling update strategies; community activity is modest.

KubeVela : Offers a high‑level application platform with OAM abstraction, supporting shrink‑then‑expand, but adds complexity for simple use‑cases.

The investigation concluded that KubeVela already implements shrink‑then‑expand, providing useful references, while the final implementation would use a dual‑deployment approach estimated to require 23 person‑days.

Key Design Points of the Shrink‑Then‑Expand Model

Critical checks were added at each stage (pre‑release, during release, post‑release) to ensure the number of available instances never drops to zero, to validate upstream lists after scaling, and to monitor pod readiness via the pod_ready_total metric, triggering alerts if availability falls below 50%.

Practical Challenges and Solutions

Resource release delay after shrinking caused scheduling failures; the solution added a verification step with a one‑minute timeout before proceeding to expansion.

During peak periods, disabling HPA led to capacity shortages; the remedy involved splitting the core application into multiple environments, pre‑evaluating traffic, adjusting batch counts, and exploring dual‑deployment HPA activation.

Results

The new approach yielded substantial improvements:

Release Efficiency : 70% faster release and rollback for 11 core applications.

Resource Cost : 32.5% CPU reduction.

System Pressure : Instance count reduced by an order of magnitude, stabilizing downstream dependencies.

Conclusion and Future Work

The shrink‑then‑expand strategy combined with physical‑machine container deployment effectively solved release‑time bottlenecks but highlighted capacity‑assessment risks. Planned enhancements include a release guardian for real‑time risk assessment and enabling HPA during releases to handle sudden traffic spikes.

cloud-nativeoperationsdeploymentcontainerrelease-strategy
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.