Cloud Computing 17 min read

How HuoLala Mastered Multi‑AZ Cloud Server Management and Solved ECS Bottlenecks

This article details HuoLala’s evolution from a single‑region cloud setup to a multi‑AZ, multi‑cloud architecture, covering network latency, ECS resource constraints, operational event handling, hot migration techniques, and LLC contention mitigation strategies.

Huolala Tech

Jan 11, 2024

How HuoLala Mastered Multi‑AZ Cloud Server Management and Solved ECS Bottlenecks

1. Background

HuoLala was one of the early adopters of cloud resources. As cloud server usage scaled, several detailed issues required special attention, including availability zones, network latency and jitter, the latest generation ECS instance specifications, and multi‑AZ usage.

2. HuoLala Cloud Server Usage Practice

Initially, HuoLala used a single region with only two availability zones and most services were self‑built on ECS. Over time, the cloud provider expanded IaaS, PaaS, and SaaS services, increasing the number of zones from 2 to 6. In 2018, rapid growth demanded large ECS capacity, leading HuoLala to adopt a multi‑cloud architecture. The following sections share practical experience using ECS resources.

2.1 Network Latency Between AZs

Multi‑AZ self‑built clusters must consider inter‑zone latency, typically under 3 ms, though occasional jitter can occur.

Latency‑sensitive workloads experience lower latency when servers reside in the same AZ.

Cloud network intelligence services can display latency between zones.

Monitoring tools such as SmokePing help track latency and packet loss.

2.2 New AZ Resource Issues

Newer ECS architectures and instance types are usually bound to the latest AZs. Older zones can only obtain new specifications after the provider upgrades the zone.

2.3 Insufficient ECS Resources

When large events require many ECS instances, reserve resources a month in advance.

Ask the cloud provider to lock specific AZs and instance types.

Prefer general‑purpose instances; niche types have limited cluster scale.

Distribute workloads across multiple AZs or clouds to avoid single‑zone shortages.

Consider ARM‑based servers where applicable.

2.4 ECS Aggregation Issues

Data‑center capacity limits can cause ECS clusters to become overly aggregated on a single host, leading to host‑level failures. Mitigation includes:

Regular aggregation inspections and dispersal requests.

Deploy services across multiple AZs.

Use provider‑specific deployment sets (e.g., Alibaba Cloud server groups, Huawei Cloud server groups, Tencent Cloud placement groups) with high‑availability or low‑latency strategies.

3. ECS Operational Events

Operational events are inevitable. They are categorized as:

Planned maintenance (cloud platform notifies users of potential risks).

Unplanned incidents (hardware failures causing VM crashes).

Local‑disk events (require data backup and may need reboot or redeployment).

Event Type

Advance Notice

Backup Needed

Reboot Required

Follow‑up Action

Example Scenario

Hot Migration

Alert if VM experiences jitter

Common ECS network jitter

Planned Maintenance

Yes

Check service health

ALL

Unplanned Incident

Yes

Check service health

ALL

Local Disk Event

Yes

Yes (backup required)

Partial/Full

Check service health

Big Data – CDH Hadoop

3.1 Hot Migration Technology

Hot migration moves a running VM between physical hosts without downtime. Techniques include Pre‑Copy, Post‑Copy, and Hybrid Copy. High memory write rates can cause migration failures.

3.2 Hot Migration Use Cases

Proactive maintenance when a host fails.

Load balancing by moving VMs from overloaded hosts.

Any scenario requiring VM movement without reboot.

3.3 Hot Migration Limitations

Only between identical host types with matching firmware.

Not supported for VMs using local storage.

VMs with GPU/FPGA or passthrough devices cannot be hot‑migrated.

3.4 Cloud Provider Support

Provider

Hot Migration (Cloud Disk VM)

Planned Maintenance

Unplanned Incident

Local Disk Swap

Local Disk Redeploy

Alibaba Cloud

No reboot (DingTalk request)

API/Console reboot → auto‑migrate

Crash → auto‑migrate

No reboot (data cleared)

Redeploy (data cleared)

Huawei Cloud

No reboot (WeChat request)

API/Console reboot → auto‑migrate

Crash → auto‑migrate

Partial/no reboot, hot‑plug supported

Redeploy (data cleared)

Tencent Cloud

No reboot (WeChat request)

API/Console reboot → auto‑migrate

Crash → auto‑migrate

No reboot (data cleared)

Redeploy (data cleared)

AWS

Not supported

API/Console → stop → start

Crash → auto‑migrate

Not applicable

Azure

Not supported

API/Console → stop → start

Crash → auto‑migrate

Not applicable

4. LLC Contention and Its Impact

LLC (Last Level Cache, L3) is shared among cores. Contention can raise CPU usage on a node by 10‑20% and increase application response time (RT). Restarting services or the VM often does not resolve the issue; hot migration to a less‑loaded host restores normal performance.

4.1 LLC Is Independent of Instance Size

Both small (4C 8G) and large (32C 128G) instances placed on the same physical host can suffer from LLC contention; larger specs do not guarantee immunity.

4.2 CPU Architecture Relationship

Intel x86 CPUs share L3 cache across cores.

AMD EPYC CPUs use a CCD/CCX design with shared 16 MiB L3 per CCX.

ARM servers (e.g., Alibaba Yitian 710) provide dedicated physical cores without hyper‑threading.

5. Ongoing Collaboration with Cloud Vendors

HuoLala regularly communicates with cloud providers to set joint 1‑3 year goals and ensure continuous improvement of the cloud environment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Computing ECS Hot Migration availability zones LLC contention

Written by

Huolala Tech

Technology reshapes logistics

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.