Automated Capacity Planning and Auto‑Scaling for Hotel Services During Traffic Peaks
This document describes a comprehensive capacity‑planning solution that predicts traffic‑peak impacts for hotel services, automatically estimates required CPU resources, creates timed scaling tasks, and evaluates performance using detailed metrics, thereby improving operational efficiency and reducing manual effort during events such as exam‑ticket printing and holiday travel surges.
1. Background
Traffic spikes caused by events such as exam‑ticket printing or holiday travel can overwhelm hotel services, leading to performance degradation, throttling, or crashes. Existing Horizontal Pod Autoscaler (HPA) requires manual calculation of required machine numbers, which is inaccurate and inefficient.
Automatically estimating event impact and pre‑scaling services can protect stability and improve operational efficiency.
2. Overall Solution
The solution integrates a traffic‑calendar platform, an algorithm service, and Ops (operations) interfaces to predict CPU requirements and trigger automatic scaling.
2.1 System Architecture
(1) The traffic‑calendar platform aggregates business monitoring data and obtains CPU core counts from Ops.
(2) It determines the peak order/QPS value for the event, calls the algorithm service to predict total CPU cores needed.
(3) Ops converts the predicted CPU cores into an estimated instance count and schedules automatic scaling tasks.
2.2 Business Process
The event lifecycle includes nine stages: pre‑judgment → pending evaluation → evaluating → evaluation completed → task creation → scaling → scaling completed → review → closed.
Key steps:
Hotspot Event Entry : Events can be imported from Ctrip or created manually, entering the pre‑judgment state.
Event Pre‑judgment : Determines whether automatic scaling is needed; if not, the event ends.
Pending Evaluation : Estimates peak business volume using the formula Peak Business = Baseline × (1 + Growth Rate) .
Evaluation : The traffic‑calendar calls the algorithm to predict CPU usage for the estimated peak.
Task Creation : Ops creates timed scaling tasks based on the predicted instance count, using a mix of on‑premise and cloud resources.
Review & Closure : After the peak, the system reviews accuracy and coverage metrics.
2.3 Metrics
Key indicators include:
Prediction accuracy (a, M, N, K codes)
Coverage rate
Average absolute percentage error (MAPE)
Order‑CPU correlation coefficient
Average actual CPU usage
Mean absolute error of CPU prediction
Formulas such as Platform Estimated Cores = Algorithm Predicted Cores × (1 + Safety Threshold) are used to compute final scaling numbers.
2.4 Algorithm Details
The model is a neural‑network trained on recent two‑month data from containerized applications with auto‑scaling enabled. Training data includes application, sub‑environment, timestamp, order volume, and CPU usage. The model focuses on order volume as the primary factor influencing CPU consumption.
Model validation uses metrics like MAPE (0.08) and order‑CPU correlation (0.91). The model is updated periodically, ensuring offline validation and online performance monitoring.
2.5 HPA Scaling Safety Strategy
Safety limits are applied to maximum and minimum replica counts. If predicted instances exceed the configured maximum, creation is blocked with a notification. If predicted instances fall below the minimum threshold (1‑a%), scaling is restricted to maintain stability.
3. Project Data & Value
• Over 150 hotel applications (≈90% of total cores) are integrated.
• Completed high‑peak event protection for exam‑related and holiday traffic.
• Average coverage: 96%; average accuracy: 89%.
• Each peak event saves ~3 person‑days of manual ops, totaling ~270 person‑days annually, and reduces resource prediction cost by ~20%.
4. Future Plans
Expand intelligent scaling to all application scenarios, including bare‑metal and KVM, and add DB/Redis resource checks.
Leverage AI to improve business‑volume forecasting and strengthen order‑CPU correlation.
Broaden adoption across all business lines for company‑wide resource orchestration.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.