Operations 11 min read

Optimization and Redesign of Open-Falcon Monitoring System for the 37 Monitoring Platform

The project redesigns the Open‑Falcon monitoring system for the 37 platform by integrating it with the existing CMDB, adding distributed‑lock high‑availability for judge and alarm modules, optimizing cross‑region agent data transmission, fixing timezone inconsistencies, and enabling redundant query/graph services, thereby unifying disparate monitoring tools into a scalable, reliable solution.

37 Interactive Technology Team
37 Interactive Technology Team
37 Interactive Technology Team
Optimization and Redesign of Open-Falcon Monitoring System for the 37 Monitoring Platform

Background

The monitoring system is a critical component of any business infrastructure, acting like eyes that continuously observe data centers, networks, servers, and applications, and respond promptly when issues arise.

History of 37 Monitoring

Initially, 37 Monitoring relied on Zabbix and DBA TianTu. As business lines expanded, the number of monitoring items grew, exposing limitations of these systems and making it impossible to support rapid business development.

Zabbix Core Issues

1. Performance : Single‑point architecture cannot scale beyond ten thousand machines; storage relies on MySQL.

2. Functionality : Deep custom development is difficult; lacks support for tailored features. The goal is full integration with OPS for unified operations.

3. Usability : Complex configuration, high learning curve, and insufficient external APIs hinder integration with other systems.

Consequently, a replacement was sought, and after evaluation, the open‑source Open‑Falcon system (originally released by Xiaomi) was selected.

Problem Analysis of Open‑Falcon

1. Open‑Falcon maintains its own CMDB, which is not synchronized with the existing operations platform CMDB, leading to inconsistent machine information and reduced monitoring accuracy.

2. Cross‑region agents experience network latency, causing frequent false alarms.

3. Core strategy modules (Transfer → Judge → Alarm → Sender) are single points; failure of any module invalidates alarm strategies.

4. Timezone inconsistencies between domestic and overseas monitoring data.

Architecture Comparison (Before & After)

Official Open‑Falcon Architecture:

37‑Falcon Architecture:

Main Architectural Changes

1. Distributed‑lock high‑availability redesign for alarm‑strategy modules (Judge, Alarm)
2. Overseas data transmission: failover strategy and dedicated line to improve QoS and reliability
3. Web reconstruction: Agent auto‑associates with CMDB, unifying the operations platform
4. ...

The following sections detail the specific improvements.

Secondary Redesign – Agent Auto‑Association with CMDB

Background

1. The unique identifier of machines in the current CMDB is SN. To align monitoring with the CMDB, SN must be propagated; otherwise, discrepancies in machine, domain, and network resources lead to inaccurate monitoring data and high operational costs.

2. This issue is common across monitoring systems and requires deep integration with the operations CMDB.

Solution

1. During Agent startup, automatically retrieve SN from the CMDB API (typically performed during machine initialization after automatic Agent installation).

2. Periodically synchronize CMDB metadata with the monitoring CMDB to reduce development effort.

Implementation

Macro diagram:

Core workflow diagram:

PS: A cache file for SN is used to avoid network failures that would otherwise cause data collection failures.

Effect 1 – Self‑Associating CMDB

Effect 2 – Monitoring Automation Coverage

Agent Data Transmission Optimization

Agent forwards a list of Transfer/Gateway machines; the list is sorted by ICMP latency and port connectivity, enabling nearest‑node access, reducing latency, and improving throughput (especially overseas).

Background

Overseas agents send data through international links, often experiencing packet loss that leads to false alarms.

Analysis

1. Intelligent DNS selects low‑latency Transfer nodes based on region (with failover).

2. Dedicated overseas lines eliminate packet loss.

3. RPC timeout tuning for Agent‑Transfer connections prevents congestion under high concurrency (see official optimization at https://github.com/open-falcon/falcon-plus/issues/290).

Implementation

Overall optimization diagram:

Agent → Gateway intelligent DNS with failover:

RPC timeout tuning:

Dedicated line architecture:

Advantages of the dedicated line:

1. International packet loss reduced from ~70% to 0%.

2. Monitoring bandwidth requirements are low; 4 Mbps with dual‑link redundancy is sufficient.

Judge & Alarm Modules Single‑Point Redesign

Background: The Judge module is a single point; its failure disables alarm strategies. Although hash sharding is possible, each shard remains a single point.

Architecture:

Design principle: Use Redis SETNX to create a distributed lock (Redis is single‑threaded, allowing simple lock implementation).

1. Multiple Judges compete for the lock; the winner inserts strategy data into Redis.

2. Alarm retrieves the strategy from Redis and acquires the lock to provide high‑availability service.

3. Transfer performs dual‑write HA; the default version does not use Redis Cluster mode.

Note: Currently only one Judge instance (DB01) is online; additional instances can be added for horizontal scaling.

Transfer Redesign – Cross‑Timezone Data Time Optimization

Background: Agents worldwide use local time for metrics, causing inconsistent timestamps when data reaches Transfer, Judge, and Alarm, leading to confusing alarm times and mismatched graph displays.

Approach: Adjust Transfer’s time‑handling logic; if an alarm timestamp is unreasonable, replace it with the current time.

Query Redesign – Graph High‑Availability

Problem: Query accesses Graph via hash sharding, but each shard supports only one address. If a machine fails, Query cannot retrieve data, causing missed alarms and broken front‑end trend charts.

Optimization: Enable each shard to support multiple IPs for automatic redundancy (Graph already stores duplicate data across shards).

Summary

1. The main difficulty of this project lies not in technical challenges but in unifying disparate monitoring systems into a single platform.

2. Leveraging an open‑source system requires deep code analysis and thorough understanding, which ultimately eliminates operational risk and better serves business needs.

monitoringarchitecturehigh availabilityopsCMDBOpen-Falcon
37 Interactive Technology Team
Written by

37 Interactive Technology Team

37 Interactive Technology Center

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.