Mitigating Hash Polarization and Elephant Flow in UCloud Physical Cloud Gateway Clusters: Multi‑Tunnel and Capacity Management Solutions
This article presents a detailed case study of how UCloud resolved hash polarization and elephant‑flow overload in physical cloud gateway clusters by deploying a multi‑tunnel traffic‑splitting strategy, expanding gateway capacity, implementing lossless isolation‑zone migration, and enhancing automation and high‑availability mechanisms, enabling the clusters to handle hundreds of gigabits of traffic during peak events.
Physical cloud hosts are dedicated servers offered by UCloud, providing high compute performance for core applications, while physical cloud gateways enable internal communication between physical and public cloud products across regions, leading to significant cross‑cluster traffic pressure.
Hash polarization caused severe traffic overload on gateway device e in cluster 2, with traffic heavily concentrated from cluster 1, resulting in bandwidth saturation on a single device while others remained underutilized.
Hash polarization occurs when a single tunnel encapsulates traffic, hiding original IP/MAC information; the hash algorithm then yields identical results, preventing effective load distribution and causing certain devices to be overwhelmed.
Two main mitigation directions were explored: (1) dispersing user traffic to avoid hash polarization after tunnel encapsulation, and (2) protecting the network from "elephant flow" when traffic cannot be dispersed.
Solution 1 – Multi‑Tunnel Approach : Instead of a single‑tunnel mode, each gateway binds a range of tunnel IPs. By hashing on inner packet information and selecting source and destination IPs from the allocated range, traffic is spread across multiple tunnels, effectively breaking hash polarization.
Solution 2 – Preventing Elephant Flow : Even with multiple tunnels, extremely large flows can still overload the network. UCloud therefore employs (1) per‑gateway capacity management to ensure gateway bandwidth exceeds the aggregate bandwidth of hosted physical cloud hosts, and (2) isolation‑zone lossless migration that automatically redirects excess traffic to a zero‑traffic isolation zone and validates migration results.
After deploying the new multi‑tunnel solution, the cluster capacity increased from handling tens of gigabits to supporting over a hundred gigabits, successfully smoothing traffic spikes such as Dada's 60‑100 Gb traffic during Double‑Eleven.
High‑Availability Optimizations : The upgrade introduced a gray‑deployment process where a new cluster is gradually rolled out and traffic is migrated back to the old cluster if issues arise, minimizing impact. However, a misconfiguration caused the new manager to mistakenly take over the old cluster, leading to HA anomalies.
Risk analysis identified three main causes: excessive manual intervention, insufficient exception protection in programs, and inadequate isolation between clusters.
Optimization Measures :
Automation of operations to replace manual steps, separating configuration entry from deployment.
Enhanced validation and alerting, including whitelist filtering before loading configurations.
Isolation of impact by removing common dependencies (e.g., assigning different managers to different clusters) and establishing isolation zones to limit fault propagation.
In conclusion, as system architectures become more complex, technical solutions must continuously evolve to serve business needs; the experiences and practices shared here aim to provide valuable insights for engineers facing similar challenges.
Architects' Tech Alliance
Sharing project experiences, insights into cutting-edge architectures, focusing on cloud computing, microservices, big data, hyper-convergence, storage, data protection, artificial intelligence, industry practices and solutions.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.