Operations 19 min read

Design and Implementation of a Multi‑Layer Load‑Balancing Platform (VGW)

This article explains how to build a reliable, high‑performance load‑balancing platform by analyzing basic reliability requirements, introducing multi‑layer (DNS, L4, L7) balancing, comparing Direct‑Route, Tunnel and FULLNAT modes, and describing the VGW architecture, health‑check, fault isolation, redundancy and DPDK‑based performance optimizations.

Architecture Digest
Architecture Digest
Architecture Digest
Design and Implementation of a Multi‑Layer Load‑Balancing Platform (VGW)

In large‑scale business scenarios a single server cannot handle traffic, so load balancing becomes essential; the article starts by defining the two core problems: request distribution among servers and isolation of faulty servers.

It then discusses why a dedicated load‑balancing platform is needed, reviewing DNS‑based, Nginx (L7) and LVS (L4) approaches, and highlighting the limitations of each when deployed as clusters.

A three‑tier load‑balancing chain is proposed: business layer ← L7 (Nginx) ← L4 (LVS) ← network‑layer devices, illustrating how each layer can provide redundancy for the layer above it.

The focus shifts to L4 load balancing, presenting four traffic‑forwarding solutions—Direct‑Route (DR), Tunnel, and FULLNAT—detailing their mechanisms, advantages, and drawbacks, and concluding that FULLNAT best fits the described non‑virtualized environment.

Health‑check mechanisms are described: simple TCP/UDP port probing to quickly discard unhealthy back‑ends without heavy protocol‑specific checks.

Fault isolation relies on BGP‑advertised VIPs; removing a VIP from a server’s BGP announcement isolates that server, while withdrawing the VIP from the whole cluster redirects traffic to a standby cluster.

The VGW (vivo Gateway) implementation is introduced, consisting of three core modules: load‑balancing forwarder, health‑check, and routing control. Logical and physical architectures are shown, including dual‑NIC “dual‑arm” mode for external VGW and single‑NIC “single‑arm” mode for internal VGW.

Redundancy strategies cover server‑level, link‑level, process‑level, and manual isolation, with monitoring nodes establishing VIP connections to assess overall health.

Performance challenges of handling millions of packets per second are addressed by offloading packet processing to DPDK (via a customized DPVS), eliminating kernel interrupts and reducing copy overhead, achieving >1 M CPS and >12 M PPS on 100 GbE NICs.

The article concludes that the described multi‑layer, FULLNAT‑based VGW meets Vivo’s reliability and scalability needs, while acknowledging future work on new protocols and decentralized data‑center models.

network architecturehigh-availabilityLoad BalancingDPDKBGPVGWFULLNAT
Architecture Digest
Written by

Architecture Digest

Focusing on Java backend development, covering application architecture from top-tier internet companies (high availability, high performance, high stability), big data, machine learning, Java architecture, and other popular fields.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.