Operations 16 min read

How BigBrother Revolutionizes Large‑Scale Virtual Network Connectivity Checks

BigBrother is a TCP‑based, full‑link, large‑scale network connectivity detection system that uses packet coloring and GRE mirroring to automatically locate virtual network faults across public, hybrid, and physical clouds, dramatically reducing troubleshooting time and supporting high‑concurrency tasks.

Efficient Ops
Efficient Ops
Efficient Ops
How BigBrother Revolutionizes Large‑Scale Virtual Network Connectivity Checks

Problem with Traditional Network Troubleshooting

Virtual network troubleshooting is difficult; tools like traceroute are limited, and most cases require packet captures on host or hybrid cloud gateways, which are time‑consuming and labor‑intensive, especially for long transmission paths across domains.

BigBrother Overview

BigBrother is an internal detection system that supports full‑link, large‑scale network connectivity checks. It uses TCP packet coloring to separate detection traffic from user traffic, works across physical clouds and cross‑region scenarios, and provides a framework that helps operations teams pinpoint faults or quickly verify virtual network health.

Since its launch, BigBrother has been used for connectivity verification before and after cloud host migrations, detecting nearly ten anomalies during the migration of over 2,000 hosts.

Limitations of First‑Generation Tool

The previous tool relied on SSH jumps to hosts and OVS packet‑out commands, then tcpdump on the remote host. It suffered from low efficiency, limited scenario support, and inability to handle DPDK or P4 gateway products.

BigBrother Architecture

BigBrother consists of several components: mafia provides a console for task creation and result display, minitrue translates user parameters into packet injection ranges, and telescreen constructs and sends packets. The system monitors the entire network by injecting GRE‑encapsulated probe packets at entry points and mirroring them at endpoints.

Key Concepts

Entrypoint : the inbound/outbound interface where probe packets are sent and received.

Endpoint : the network element closest to an instance, used for sampling and mirroring probe packets.

Different cloud scenarios map these concepts to specific devices (e.g., OVS in public cloud, vpcgw/hybridgw in physical cloud, sdngw in cross‑domain).

Detection Flow

BigBrother simulates a probe from source to destination, the entry point forwards it to the endpoint, which mirrors the packet to BigBrother, then the packet proceeds to the instance. The instance replies, and the reverse path mirrors the response back to BigBrother. Receiving all expected mirrored packets confirms connectivity.

Probe Packet Design

Two candidate designs were evaluated:

ICMP + TOS : uses ICMP packets with TOS coloring, but requires complex flow rules and cannot learn reverse flows in hybrid clouds.

TCP : uses TCP packets with a specific source/destination port (port 11) for coloring, simplifying flow rules and supporting both public and hybrid clouds.

Implementation example:

cookie=0x20008,table=1,priority=40000,tcp,metadata=0x1,tp_src=11,tp_dst=11 actions=Send_BB(),Back_0()

Concurrency Enhancement

BigBrother leverages the 32‑bit TCP sequence number to encode a Task ID (5 bits) and Pair ID (remaining bits), allowing up to 32 concurrent tasks and each task to handle up to 2^27 pairs, sufficient for full‑mesh checks of VPCs with ~10,000 hosts.

Task Execution Pipeline

When an operator creates a BigBrother task via the mafia console, the workflow is:

mafia sends a request to minitrue, which determines the probe range.

minitrue passes source/destination lists to telescreen.

telescreen builds GRE packets, injects them, and captures mirrored packets.

minitrue periodically analyzes the captured packets.

The final report is displayed in mafia, showing total pairs, successes, failures, and a bitmap indicating per‑pair connectivity.

Active‑Flow Based Checks

For scenarios where full mesh checks are unnecessary, BigBrother can integrate with the River service to obtain active flow lists, reducing load and focusing verification on hot paths.

Future Plans

Extend detection to include latency, maximum latency, and packet loss metrics.

Build continuous internal network monitoring for specific customers based on BigBrother.

MonitoringTroubleshootingCloud NetworkingBigBrothernetwork connectivity
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.