Adopting a Shrink‑Then‑Expand Deployment Model to Improve Release Efficiency in a Large‑Scale Travel Platform
This article analyzes the release‑time bottlenecks of a core travel platform after the post‑COVID traffic surge and presents a shrink‑then‑expand deployment strategy combined with physical‑machine container deployment, evaluating several open‑source solutions and demonstrating significant improvements in release speed, resource cost, and system stability.
After the COVID‑19 pandemic, the travel industry experienced a rapid growth in traffic, causing the core application of the platform to double its release time and severely affect development efficiency. To address this, the team explored a shrink‑then‑expand deployment model that isolates each pod on a dedicated physical machine, reducing release duration and avoiding pod eviction during peak periods.
Vertical scaling of pod instances and assigning each pod to its own physical host lowered release time and eliminated eviction issues.
Switching the release model from expand‑then‑shrink to shrink‑then‑expand removed the need for extra buffer machines, cutting resource costs.
The problem analysis identified two main phenomena during high‑traffic releases: doubled release duration and two side effects—upstream interface timeouts due to increased metadata and saturated resource pools (Redis, MySQL). Increasing batch counts mitigated concurrency but introduced additional retry delays of about five minutes per failed pod.
Detailed pod lifecycle analysis showed that most failures occurred during the pre‑heat stage, either due to high‑concurrency pre‑heat timeouts or resource pressure causing Kubernetes eviction. The team considered exclusive pod placement on physical machines, but resource waste was a concern.
Solution Design
Multiple deployment options were evaluated:
KVM Deployment : No extra development cost but slow scaling due to manual approval.
Container Deployment – Expand‑Then‑Shrink : Fast rollout using existing pipelines but higher resource cost.
Container Deployment – Shrink‑Then‑Expand : Leverages HPA for elastic scaling, balancing cost and speed, though it requires additional development effort.
Argo CD Semi‑Automatic Deployment : Quick rollout but requires new operational knowledge and lacks one‑click rollback.
After discussion, the team adopted Argo CD as a temporary solution while implementing the shrink‑then‑expand mode in the container platform.
Open‑Source Tool Research
The team compared ArgoCD, Argo Rollout, Kruise, and KubeVela for supporting the shrink‑then‑expand pattern.
Argo Rollout : Provides progressive delivery (blue‑green, canary) and integrates well with Argo CD, but focuses on delivery rather than full‑stack platform features.
Kruise Rollout : Native Kubernetes solution with advanced rolling update strategies; community activity is modest.
KubeVela : Offers a high‑level application platform with OAM abstraction, supporting shrink‑then‑expand, but adds complexity for simple use‑cases.
The investigation concluded that KubeVela already implements shrink‑then‑expand, providing useful references, while the final implementation would use a dual‑deployment approach estimated to require 23 person‑days.
Key Design Points of the Shrink‑Then‑Expand Model
Critical checks were added at each stage (pre‑release, during release, post‑release) to ensure the number of available instances never drops to zero, to validate upstream lists after scaling, and to monitor pod readiness via the pod_ready_total metric, triggering alerts if availability falls below 50%.
Practical Challenges and Solutions
Resource release delay after shrinking caused scheduling failures; the solution added a verification step with a one‑minute timeout before proceeding to expansion.
During peak periods, disabling HPA led to capacity shortages; the remedy involved splitting the core application into multiple environments, pre‑evaluating traffic, adjusting batch counts, and exploring dual‑deployment HPA activation.
Results
The new approach yielded substantial improvements:
Release Efficiency : 70% faster release and rollback for 11 core applications.
Resource Cost : 32.5% CPU reduction.
System Pressure : Instance count reduced by an order of magnitude, stabilizing downstream dependencies.
Conclusion and Future Work
The shrink‑then‑expand strategy combined with physical‑machine container deployment effectively solved release‑time bottlenecks but highlighted capacity‑assessment risks. Planned enhancements include a release guardian for real‑time risk assessment and enabling HPA during releases to handle sudden traffic spikes.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.