Operations 8 min read

How Octopux Achieves 99.9% Bandwidth Monitoring Accuracy at Scale

Octopux is an open‑source bandwidth monitoring platform designed by Baishan Cloud to deliver 99.9% data integrity, cross‑operator and cross‑country coverage, minute‑level granularity, and horizontal scalability for tens of thousands of devices, addressing the limitations of traditional tools like Cacti.

Efficient Ops
Efficient Ops
Efficient Ops
How Octopux Achieves 99.9% Bandwidth Monitoring Accuracy at Scale

Introduction

Bandwidth monitoring is essential for carrier settlement and network quality monitoring, and it is a must‑have system for any internet company with self‑built resources.

Baishan Cloud operates thousands of devices across dozens of countries and many carriers. In such a complex environment, ensuring precise bandwidth data collection, flexible integration for different scenarios, and scalability to tens of thousands of devices poses significant technical challenges.

In March, Baishan Cloud open‑sourced Octopux , its self‑developed bandwidth monitoring system, to share its solution with the community.

Conventional Systems Fall Short

Like many companies, we initially used Cacti, but as the network grew to about 800 devices, Cacti showed serious problems:

Insufficient poller concurrency

Monitoring 800 devices required a 5‑minute granularity, while business needs demanded 1‑minute granularity.

Frequent cross‑carrier monitoring failures

Even with the Cacti server in a three‑tier BGP data center, data loss across carriers persisted.

Server I/O bottlenecks

Updating 8,000 RRD files every 5 minutes caused severe disk I/O issues.

Low data extraction efficiency

Bandwidth data stored in binary RRD files could only be extracted via rrd‑tool, making flexible data aggregation impossible.

Octopux Breaks Technical Bottlenecks

We abandoned further Cacti modifications and, after studying many open‑source projects, built a new bandwidth monitoring system.

The core design goals are:

99.9% data completeness

Perfect cross‑carrier and cross‑country monitoring

Horizontal scalability to support tens of thousands of servers

Second‑level granularity

Simple and efficient data query interface

Architecture diagram:

1. swcollector Data Collection Module

swcollector runs as a background process on every server, collecting data and sending it to the swtfr component of the data collection center.

In cross‑network or cross‑country environments, if swcollector cannot reach swtfr directly, it retries via multiple gateways to ensure delivery.

2. Data Writing and Querying

Three sets of

swtfr + influxdb + flow‑api

components are deployed globally. Each swtfr replicates incoming data to three InfluxDB instances.

flow‑api handles queries and aggregation: it splits a query into minimal‑granularity events, queries the three InfluxDB nodes in parallel, then aggregates the results.

flow‑api supports common aggregations such as max, min, average, and group‑by.

3. Monitoring Efficiency

The system can monitor up to 150,000 data points per minute, with over 90% of data written within 3 seconds and query latency under 3 seconds, fully meeting business requirements.

When the monitoring scale grows, horizontal expansion of the

swtfr + influxdb + flow‑api

stack further boosts performance; InfluxDB itself also scales horizontally for larger storage and higher read/write throughput.

4. Service Capabilities

Bandwidth monitoring is self‑discovering. It supports per‑NIC inbound/outbound monitoring, server internal/external bandwidth separation, and per‑port inbound/outbound monitoring.

Switch bandwidth data is collected by multiple swcollector instances, aggregated by flow‑api, and output with 1‑minute granularity for higher precision.

Integration with CMDB enables automatic hierarchical display, merging by topology, usage, or billing comparison, with query response times in seconds.

Example data visualization:

Open Source Back to the Community

The bandwidth monitoring system, while seemingly simple, becomes complex at large scale. We have open‑sourced Octopux to provide inspiration and reference.

octopux‑swcollector https://github.com/baishancloud/octopux-swcollector octopux‑swtfr https://github.com/baishancloud/octopux-swtfr octopus‑gateway https://github.com/baishancloud/octopus-gateway

Thanks to Xiaomi’s open‑source open‑falcon project for foundational components.

Postscript

With the maturing InfluxDB ecosystem—lightweight, dependency‑free, rich visual components, and built‑in aggregation—we are exploring a third‑generation monitoring system to achieve better aggregation analysis and complex alerting.

scalable architectureopen-sourceInfluxDBnetwork operationsbandwidth monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.