How HuoLala Mastered Multi‑AZ Cloud Server Management and Solved ECS Bottlenecks
This article details HuoLala’s evolution from a single‑region cloud setup to a multi‑AZ, multi‑cloud architecture, covering network latency, ECS resource constraints, operational event handling, hot migration techniques, and LLC contention mitigation strategies.
1. Background
HuoLala was one of the early adopters of cloud resources. As cloud server usage scaled, several detailed issues required special attention, including availability zones, network latency and jitter, the latest generation ECS instance specifications, and multi‑AZ usage.
2. HuoLala Cloud Server Usage Practice
Initially, HuoLala used a single region with only two availability zones and most services were self‑built on ECS. Over time, the cloud provider expanded IaaS, PaaS, and SaaS services, increasing the number of zones from 2 to 6. In 2018, rapid growth demanded large ECS capacity, leading HuoLala to adopt a multi‑cloud architecture. The following sections share practical experience using ECS resources.
2.1 Network Latency Between AZs
Multi‑AZ self‑built clusters must consider inter‑zone latency, typically under 3 ms, though occasional jitter can occur.
Latency‑sensitive workloads experience lower latency when servers reside in the same AZ.
Cloud network intelligence services can display latency between zones.
Monitoring tools such as SmokePing help track latency and packet loss.
2.2 New AZ Resource Issues
Newer ECS architectures and instance types are usually bound to the latest AZs. Older zones can only obtain new specifications after the provider upgrades the zone.
2.3 Insufficient ECS Resources
When large events require many ECS instances, reserve resources a month in advance.
Ask the cloud provider to lock specific AZs and instance types.
Prefer general‑purpose instances; niche types have limited cluster scale.
Distribute workloads across multiple AZs or clouds to avoid single‑zone shortages.
Consider ARM‑based servers where applicable.
2.4 ECS Aggregation Issues
Data‑center capacity limits can cause ECS clusters to become overly aggregated on a single host, leading to host‑level failures. Mitigation includes:
Regular aggregation inspections and dispersal requests.
Deploy services across multiple AZs.
Use provider‑specific deployment sets (e.g., Alibaba Cloud server groups, Huawei Cloud server groups, Tencent Cloud placement groups) with high‑availability or low‑latency strategies.
3. ECS Operational Events
Operational events are inevitable. They are categorized as:
Planned maintenance (cloud platform notifies users of potential risks).
Unplanned incidents (hardware failures causing VM crashes).
Local‑disk events (require data backup and may need reboot or redeployment).
Event Type
Advance Notice
Backup Needed
Reboot Required
Follow‑up Action
Example Scenario
Hot Migration
No
No
No
Alert if VM experiences jitter
Common ECS network jitter
Planned Maintenance
Yes
No
Yes
Check service health
ALL
Unplanned Incident
No
No
Yes
Check service health
ALL
Local Disk Event
Yes
Yes (backup required)
Partial/Full
Check service health
Big Data – CDH Hadoop
3.1 Hot Migration Technology
Hot migration moves a running VM between physical hosts without downtime. Techniques include Pre‑Copy, Post‑Copy, and Hybrid Copy. High memory write rates can cause migration failures.
3.2 Hot Migration Use Cases
Proactive maintenance when a host fails.
Load balancing by moving VMs from overloaded hosts.
Any scenario requiring VM movement without reboot.
3.3 Hot Migration Limitations
Only between identical host types with matching firmware.
Not supported for VMs using local storage.
VMs with GPU/FPGA or passthrough devices cannot be hot‑migrated.
3.4 Cloud Provider Support
Provider
Hot Migration (Cloud Disk VM)
Planned Maintenance
Unplanned Incident
Local Disk Swap
Local Disk Redeploy
Alibaba Cloud
No reboot (DingTalk request)
API/Console reboot → auto‑migrate
Crash → auto‑migrate
No reboot (data cleared)
Redeploy (data cleared)
Huawei Cloud
No reboot (WeChat request)
API/Console reboot → auto‑migrate
Crash → auto‑migrate
Partial/no reboot, hot‑plug supported
Redeploy (data cleared)
Tencent Cloud
No reboot (WeChat request)
API/Console reboot → auto‑migrate
Crash → auto‑migrate
No reboot (data cleared)
Redeploy (data cleared)
AWS
Not supported
API/Console → stop → start
Crash → auto‑migrate
Not applicable
Not applicable
Azure
Not supported
API/Console → stop → start
Crash → auto‑migrate
Not applicable
Not applicable
4. LLC Contention and Its Impact
LLC (Last Level Cache, L3) is shared among cores. Contention can raise CPU usage on a node by 10‑20% and increase application response time (RT). Restarting services or the VM often does not resolve the issue; hot migration to a less‑loaded host restores normal performance.
4.1 LLC Is Independent of Instance Size
Both small (4C 8G) and large (32C 128G) instances placed on the same physical host can suffer from LLC contention; larger specs do not guarantee immunity.
4.2 CPU Architecture Relationship
Intel x86 CPUs share L3 cache across cores.
AMD EPYC CPUs use a CCD/CCX design with shared 16 MiB L3 per CCX.
ARM servers (e.g., Alibaba Yitian 710) provide dedicated physical cores without hyper‑threading.
5. Ongoing Collaboration with Cloud Vendors
HuoLala regularly communicates with cloud providers to set joint 1‑3 year goals and ensure continuous improvement of the cloud environment.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
