How Multi‑Tunnel Architecture Resolved Physical Cloud Traffic Overload
This article details how UCloud tackled severe traffic overload in its physical cloud gateway caused by hash polarization, introducing a multi‑tunnel solution, capacity management, isolation‑zone migration, and automated operations to achieve high availability and support hundreds of gigabits of traffic.
Background
UCloud's physical cloud hosts are dedicated servers offering high compute performance for core applications. Physical cloud gateways enable internal communication between physical and public cloud products, handling massive cross‑region, cross‑cluster traffic.
Problem: Hash Polarization and Overload
Monitoring revealed that gateway device e in cluster 2 was overloaded while other devices were underutilized, with most traffic originating from cluster 1. The root cause was hash polarization: a single tunnel encapsulated traffic, causing the hash algorithm to produce identical results and concentrate load on a few devices, leading to overload.
Solution 1: Multi‑Tunnel Approach
To break the single‑tunnel limitation, each gateway now binds a range of tunnel IPs. Traffic is hashed based on inner packet information, and a tunnel SIP/DIP is selected from the pre‑allocated range, distributing flows across multiple tunnels and effectively scattering traffic.
Preventing "Elephant Flow"
When a single user generates massive traffic, even multiple tunnels may be insufficient. UCloud mitigates this by increasing gateway capacity and employing isolation‑zone lossless migration, which automatically redirects excess traffic to isolated zones and validates migration results with strong checks.
Capacity Management and Isolation Zone
Gateways are provisioned with bandwidth exceeding that of physical cloud hosts (e.g., increasing per‑node capacity from 10 G to 25 G) to absorb sudden spikes. The isolation zone, normally traffic‑free, can absorb overflow when monitoring detects risk of overload.
High Availability Upgrade
During upgrades, a gray‑deployment strategy is used: a new cluster is deployed, traffic is gradually migrated, and if issues arise, services can be rolled back to the old cluster. This reduces impact scope and ensures continuity.
Risk Analysis and Automation
Human‑driven deployments increase fault probability.
Insufficient program exception handling.
Inadequate isolation between clusters.
To address these, UCloud introduced automated operations separating configuration storage and deployment, enhanced validation and alerting (e.g., whitelist filtering before loading configurations), and isolation mechanisms to limit the impact of a faulty manager.
Conclusion
The experience shows that tackling traffic overload requires both architectural changes—such as multi‑tunnel designs—and operational improvements like capacity planning, lossless migration, and automation. Ultimately, all technical solutions serve the business goal of reliable, high‑performance cloud services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
UCloud Tech
UCloud is a leading neutral cloud provider in China, developing its own IaaS, PaaS, AI service platform, and big data exchange platform, and delivering comprehensive industry solutions for public, private, hybrid, and dedicated clouds.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
