Operations 19 min read

High‑Availability Design and Implementation of the BIGO Backbone Network

This article explains how BIGO’s backbone network achieves high availability through a three‑layer design—control‑plane HA using ETCD‑based Raft leader election, data‑plane HA with MPLS SR‑Policy and intermediate Route‑Reflection layers, and business‑level HA that combines traffic, optimization, and fault scheduling to ensure seamless service continuity.

High Availability Architecture

Dec 2, 2022

High‑Availability Design and Implementation of the BIGO Backbone Network

In early 2022 BIGO published the first part of its backbone network design, introducing version 2.0 where the control plane is centralized by an SDN controller and the data plane uses MPLS SR‑Policy for intelligent traffic steering.

Control‑Plane High Availability is achieved by running the controller cluster with ETCD’s Raft consensus for leader election, ensuring automatic failover and continuous path computation. A middle‑layer based on BGP Route Reflection, Graceful Restart and Long‑Lived Graceful Restart decouples the controller from the data plane, allowing the network to remain operational even if the controller fails.

Data‑Plane High Availability relies on redundant devices and links, fault classification (ingress, transit, egress), and fast‑failover techniques such as Path Protection (pre‑computed disjoint backup paths) and Facility Protection (Ti‑LFA) that bypass failures within sub‑millisecond timescales. MPLS SR‑Policy encodes multiple candidate paths (high, medium, low priority) and uses sBFD for rapid failure detection.

Business‑Level High Availability focuses on keeping services running by guaranteeing sufficient network resources and optimal quality. BIGO’s controller performs three basic scheduling functions: traffic scheduling (reacting to link overload), optimization scheduling (periodically re‑optimizing SR‑Policy paths), and fault scheduling (immediate rerouting after failures). Telemetry collects flow, latency, loss, and jitter metrics, which are translated into MOS scores to drive quality‑aware path adjustments.

The article also presents a detailed fault‑handling workflow: devices detect a failure, Ti‑LFA immediately reroutes traffic, the controller learns the fault via BGP‑LS, computes new SR‑Policy paths, and gradually replaces old paths without packet loss, typically completing convergence within 20 seconds.

Overall, BIGO’s backbone network demonstrates that a tightly integrated SDN controller, MPLS SR‑Policy, and coordinated scheduling can provide millisecond‑level convergence, robust fault tolerance, and sustained service quality across a global infrastructure.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

High Availability fault tolerance SDN network design MPLS SR-Policy

Written by

High Availability Architecture

Official account for High Availability Architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.