Operations 8 min read

Network Operations Incident Report: BGP Routing Failure and Resolution

This report details a network operations incident where a BGP routing change caused an EBGP neighbor to go idle, outlines the step‑by‑step troubleshooting, analysis of the root cause, and the implemented solution involving a new L3 node and redundant EBGP peers.

Zhuanzhuan Tech
Zhuanzhuan Tech
Zhuanzhuan Tech
Network Operations Incident Report: BGP Routing Failure and Resolution

1 Problem Background

This is a basic network operations incident report. Historically, inter‑environment connectivity used serial connections with shared core links and forwarding nodes, which proved unreliable. The network was migrated to a full BGP architecture with EBGP peering across a hybrid‑cloud environment.

During a high‑availability test, a fiber cut caused the EBGP‑5 neighbor to enter the idle state, breaking connectivity between the office environment and the hosted IDC.

2 Exploration

The investigation focused on high‑availability aspects.

2.1 Neighbor State Confirmation

All EBGP neighbors were checked to ensure they were in the established state.

2.2 Route Validity Check

Routing announcements on both the office side and the IDC side were verified to ensure successful BGP propagation.

2.3 Route Filtering

The prefix lists on the internal export direction were examined to confirm that no filtering rules were unintentionally dropping routes.

2.4 Manually Adding a New Subnet to Trigger Route Update

A new loopback interface and subnet were configured on the office edge device, but the EBGP neighbor still did not receive the updated routes.

3 Cause Analysis

Both ends of the link have correct device configurations and normal route announcements.

The cloud L3 node receives direct neighbor routes, indicating it gets complete updates.

The next‑hop device beyond the cloud L3 node lacks the expected routes.

Debugging on the IDC side shows that the cloud EBGP neighbor does not forward the office subnet routes.

The issue stemmed from a design assumption that sub‑interfaces and EBGP peers across different ASes would bypass loop‑avoidance mechanisms, which was incorrect; the cloud L3 node filtered the routes, leading to loss of connectivity.

4 Solution

After detailed verification with the cloud provider, the following solution was implemented:

Create a dedicated L3 node in the cloud.

Connect this new L3 node to both the office environment and the IDC via separate dedicated line‑access points.

On each side, add a new sub‑interface to link with the new line‑access point and establish new EBGP peers (EBGP‑6 and EBGP‑7) with the cloud L3 node.

Publish the local routes from both the office and IDC through the new sub‑interfaces to the cloud L3 node.

The new EBGP‑6/EBGP‑7 peers provide redundant paths for the original EBGP‑5, restoring route propagation.

5 Summary

The root cause was a misconception that adding sub‑interfaces and EBGP peers across different ASes would automatically bypass AS‑path and split‑horizon loop‑prevention, which led to unnecessary troubleshooting time.

While BGPv4 remains the backbone of the Internet, cloud‑based networking introduces new concepts and products that require updated understanding; mastering these nuances is essential for building stable, resilient services.

About the author

Wan Jingrui, Head of Infrastructure Operations at Zhuanzhuan.

Routingincident responseCloud NetworkingBGPnetwork operations
Zhuanzhuan Tech
Written by

Zhuanzhuan Tech

A platform for Zhuanzhuan R&D and industry peers to learn and exchange technology, regularly sharing frontline experience and cutting‑edge topics. We welcome practical discussions and sharing; contact waterystone with any questions.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.