Why Didi’s Nightingale Is Redefining Cloud‑Native Monitoring
Nightingale, Didi’s open‑source enterprise monitoring platform, builds on Open‑Falcon but adds a hierarchical object tree, in‑memory indexing, Gorilla‑compressed time‑series storage, a hybrid push‑pull alert engine, built‑in log monitoring, and a unified monapi module, delivering scalable, cloud‑native observability for both container and bare‑metal workloads.
Overview
Nightingale is an open‑source, enterprise‑grade monitoring solution jointly developed by Didi’s Base Platform and Didi Cloud. It is designed for cloud‑native environments, supporting both containerized and bare‑metal deployments, and can scale from a few machines to hundreds of thousands.
Key improvements over Open‑Falcon
Alert engine redesign : combines push‑based real‑time evaluation with pull‑based complex condition support, enabling multi‑condition and no‑data alerts.
Navigation object tree : replaces the flat HostGroup with a hierarchical object tree for flexible grouping and policy binding.
Index module upgrade : moves metric indexing from MySQL to an in‑memory index, improving scalability and query performance.
Time‑series storage optimization : integrates Facebook’s Gorilla compression for recent data in memory while retaining rrdtool format for long‑term storage.
High‑availability alert engine : the judge module uses a heartbeat mechanism for automatic failover; the index module adopts a similar strategy.
Built‑in log monitoring : the collector can match log patterns and extract metrics directly.
Operational simplification : merges several components (portal, uic, dashboard, hbs, alarm) into a single monapi module, reducing deployment complexity.
Centralized configuration : extracts common settings into mysql.yml and address.yml, providing defaults and clearer maintenance.
Architecture
The system consists of the following core modules:
Collector (agent) : gathers host metrics, supports native log monitoring, a plugin mechanism, and custom data reporting.
Transfer : RPC interface that receives collector data and forwards it via consistent hashing to multiple TSDB and judge instances.
TSDB : stores historical time‑series data (based on Open‑Falcon’s graph component) and forwards a copy to the index module for indexing.
Index : in‑memory index replacing MySQL, enabling fast and flexible queries.
Judge : alert engine that synchronizes policies from monapi, evaluates incoming data, and pushes alert events to Redis.
Monapi (alarm) : consumes alerts from Redis, enriches them, and forwards them to various sender components (mail, SMS, etc.).
Database : MySQL stores user, team, tree node, policy, dashboard, and heartbeat information.
Differences from Open‑Falcon
Alert engine : switches from a pure push model to a push‑pull hybrid, supporting multi‑condition and no‑data alerts.
Object tree : replaces flat HostGroup with a hierarchical navigation object tree, simplifying policy binding and object management.
Indexing : replaces MySQL‑based metric index with an in‑memory index, eliminating bottlenecks at billions of metrics.
Time‑series storage : adds Gorilla compression for recent data in memory while keeping rrdtool for long‑term data.
High availability : judge and index modules include heartbeat‑based automatic failover.
Log monitoring : native log pattern matching and metric extraction are built into the collector.
Module consolidation : portal, uic, dashboard, hbs, and alarm are merged into the monapi module, reducing inter‑process calls.
Configuration centralization : common settings are moved to dedicated YAML files with sensible defaults.
Similarities to Open‑Falcon
The data model (metric, endpoint, tags) remains unchanged, allowing reuse of existing agents and plugins. Nightingale’s collector combines the original Open‑Falcon agent and falcon‑log‑agent functionality.
The overall data flow and processing logic are similar: a push model feeds data into storage and alert evaluation pipelines.
Ongoing work
Developing a metric aggregation component for cluster‑level monitoring (e.g., sum, average across nodes).
Seamless integration with Kubernetes.
Expanding and maintaining community plugins, including porting existing Open‑Falcon plugins and creating new ones.
Resources
GitHub repository: https://github.com/didi/nightingale
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
