Bilibili Data Center Network Design and Evolution (DCN V1.0 to V3.0)
Bilibili's network team designed and evolved its data‑center fabric from a stacked Layer‑2 V1.0 architecture through an M‑LAG EBGP‑based V2.0 design to a uniform box‑type V3.0 deployment, achieving greater stability, scalability, cost efficiency, and operational simplicity via extensive ARP, LACP, DHCP, hash, and BGP optimizations.
Author : Bilibili System Department Network Team
The team is responsible for planning, designing, building, operating, and optimizing Bilibili's data‑center network, covering intra‑DC networks, backbone, load balancing, transport, virtualized, and international networks.
1. Network Design Background
1.1 Stability and Scalability – Network stability and scalability are essential. Packet loss, congestion, and capacity expansion define the quality limits. The design must start from traffic models and business requirements, then consider device selection, topology, protocols, and ecosystem evolution.
1.2 Bandwidth and Traffic Patterns – Modern data‑center workloads generate massive east‑west traffic (AI, big‑data clusters, multi‑active deployments). Traditional tree topologies, optimized for north‑south traffic, cannot efficiently handle such loads without costly upgrades.
1.3 Cost Optimization – Network CAPEX is a large share of IT costs. The team reduces cost by standardizing hardware, bulk purchasing, and introducing multiple vendors to drive competition, as well as defining uniform network parameters to enable white‑box solutions.
1.4 Efficient Operation – With growing device counts, fault domains increase. A full‑routing design with a single routing protocol (e.g., BGP) reduces fault scope and operational complexity.
2. Bilibili DCN Access‑Layer Evolution
2.1 DCN V1.0 Architecture
Both access and core layers used a stacked topology with the gateway on Spine devices, forming a large Layer‑2 broadcast domain.
Stacking provided redundancy but introduced two major risks:
Risk 1 – Software Risk – Bugs in vendor firmware require patches; upgrade mechanisms like ISSU have limited applicability and high complexity.
Risk 2 – Split Risk – Failure of stack cables can cause split‑brain situations, leading to duplicate configurations and traffic disruption.
Consequently, the stacked approach was abandoned.
2.2 DCN V2.0 Architecture
In early 2019 the team replaced stacking with M‑LAG (Multi‑Chassis Link Aggregation) for the access layer and removed stacking from the Spine layer. EBGP runs between Spine and Leaf, eliminating Layer‑2 broadcast storms.
While V2.0 solved software upgrade and some scalability issues, it still faced challenges such as convergence ratio monitoring, limited spine expansion due to chassis constraints, high power cabinet requirements, reliance on heartbeat links, and elevated OPEX/CAPEX.
3. DCN V3.0 Architecture
By late 2020 the team evaluated next‑generation box‑type DCN solutions and completed validation in mid‑2021. V3.0 was deployed in a new data center in Q3 2021 and is now fully rolled out.
Key features of V3.0:
ASW (Access Switch), PSW (Polymerizes Switch), DSW (Distribute Switch) all use the same high‑performance chip (e.g., Tomahawk 3) and are box‑type devices.
Each Server‑Pod contains 4 PSW and 64 ASW for 10 GbE, supporting up to 1 200 servers; 8 PSW and 64 ASW for 25 GbE with similar capacity.
No heartbeat cables; ASW pairs operate independently, providing true dual‑active ARP.
Box‑type design eliminates the need for rack power/air‑conditioning upgrades.
Capex is reduced while maintaining a 1:1 up/down bandwidth convergence ratio.
Scalability is achieved via Cluster concepts and SuperSpine layers.
3.1 Network Parameter Optimizations
3.1.1 ARP Optimization
Enable ARP Proxy on ASW to suppress intra‑switch server traffic.
Configure identical MAC addresses on ASW VLAN interfaces for consistent forwarding.
Standardize ARP timeout values across devices.
3.1.2 Storm Suppression
Aggregate server links on ASW to block Layer‑2 broadcast/multicast/unknown unicast, enforcing full‑routing.
3.1.3 LACP Optimization Use consistent LACP System ID on ASW‑to‑server links. Assign distinct LACP Device IDs to differentiate member ports. Support both Bond Mode 4 (default) and Bond Mode 0 for special workloads. 3.1.4 DHCP Optimization Enable DHCP option 82 on ASW and set relay source IP to loopback. 3.1.5 Hash Optimization Configure five‑tuple (srcIP, dstIP, srcPort, dstPort, protocol) load‑balancing to avoid hash polarization. 3.1.6 BGP Optimization Deploy EBGP across the fabric with appropriate route‑type priorities. Apply community attributes per business class. Enable ARP‑to‑IPv4 route redistribution and BGP balance (ECMP). 3.1.7 Server Network Configuration Servers default to Bond Mode 4; some scenarios use Bond Mode 0. Recompile ARP kernel module for dual‑NIC ARP broadcasting. On ASW link up/down, trigger Bond re‑aggregation and broadcast ARP to keep tables synchronized. 3.1.8 Failover Optimization When all ASW uplinks are down, monitor‑link disables downstream ports to prevent traffic loss; when uplinks recover, downstream ports are brought up after a delay to allow routing convergence. 4. Future Outlook The team will continue to follow emerging networking technologies, aiming for higher stability, efficiency, and scalability to meet Bilibili’s fast‑evolving business demands. References [1] Facebook’s data center network: https://engineering.fb.com/2019/03/14/data-center-engineering/f16-minipack/ [2] A Border Gateway Protocol 4 (BGP‑4). https://tools.ietf.org/html/rfc4271, 2006. [3] BGP in the Data Center: https://www.oreilly.com/library/view/bgp-in-the/9781491983416/ [4] Use of BGP for routing in large‑scale data centers. https://tools.ietf.org/html/rfc7938, 2016.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.