High‑Availability Design and Implementation of the BIGO Backbone Network
This article explains how BIGO’s backbone network achieves high availability through a three‑layer design—control‑plane HA using ETCD‑based Raft leader election, data‑plane HA with MPLS SR‑Policy and intermediate Route‑Reflection layers, and business‑level HA that combines traffic, optimization, and fault scheduling to ensure seamless service continuity.
In early 2022 BIGO published the first part of its backbone network design, introducing version 2.0 where the control plane is centralized by an SDN controller and the data plane uses MPLS SR‑Policy for intelligent traffic steering.
Control‑Plane High Availability is achieved by running the controller cluster with ETCD’s Raft consensus for leader election, ensuring automatic failover and continuous path computation. A middle‑layer based on BGP Route Reflection, Graceful Restart and Long‑Lived Graceful Restart decouples the controller from the data plane, allowing the network to remain operational even if the controller fails.
Data‑Plane High Availability relies on redundant devices and links, fault classification (ingress, transit, egress), and fast‑failover techniques such as Path Protection (pre‑computed disjoint backup paths) and Facility Protection (Ti‑LFA) that bypass failures within sub‑millisecond timescales. MPLS SR‑Policy encodes multiple candidate paths (high, medium, low priority) and uses sBFD for rapid failure detection.
Business‑Level High Availability focuses on keeping services running by guaranteeing sufficient network resources and optimal quality. BIGO’s controller performs three basic scheduling functions: traffic scheduling (reacting to link overload), optimization scheduling (periodically re‑optimizing SR‑Policy paths), and fault scheduling (immediate rerouting after failures). Telemetry collects flow, latency, loss, and jitter metrics, which are translated into MOS scores to drive quality‑aware path adjustments.
The article also presents a detailed fault‑handling workflow: devices detect a failure, Ti‑LFA immediately reroutes traffic, the controller learns the fault via BGP‑LS, computes new SR‑Policy paths, and gradually replaces old paths without packet loss, typically completing convergence within 20 seconds.
Overall, BIGO’s backbone network demonstrates that a tightly integrated SDN controller, MPLS SR‑Policy, and coordinated scheduling can provide millisecond‑level convergence, robust fault tolerance, and sustained service quality across a global infrastructure.
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.