Operations 33 min read

What Can Aircraft Monitoring Teach Us About Building Effective IT Operations Monitoring?

The article explores how aviation‑grade monitoring concepts—such as multi‑level alarm classification, diverse alert delivery methods, and comprehensive sensor coverage—can inspire centralized, data‑driven IT operations monitoring architectures that reduce missed alerts, false positives, and improve response times.

Efficient Ops
Efficient Ops
Efficient Ops
What Can Aircraft Monitoring Teach Us About Building Effective IT Operations Monitoring?

Monitoring, literally “watch‑and‑control,” means having the ability to perceive, decide, and respond to the operational state of the digital world, forming the foundation of business continuity. Real‑time data collection is essential, and the collected performance, capacity, and operational metrics become the data assets for intelligent operations.

Although many mature monitoring systems exist, each focuses on different layers, making it hard to find a single tool that covers all capabilities. Over time, the value of a monitoring system often lies more in its configuration items than the system itself, so the discussion focuses on monitoring from an operations‑organization perspective rather than a single product.

1. Learning Operations Monitoring from Aircraft Monitoring

If operations work is "walking on thin ice," aviation operations is a matter of life and death. Missing or delaying an aircraft alarm can cause disaster, so monitoring systems must be reliable and alarms accurate. Aircraft monitoring also covers many dimensions—equipment, crew actions, environment, fuel—requiring extensive coverage and strict alarm handling.

The Boeing 777‑200LR uses over 3,000 sensors to monitor internal devices, crew actions, external environment, and fuel. Alerts are graded, and each level has specific handling procedures, providing useful references for IT monitoring.

1) Alarm Grading

Memo : Normal state that crew must be aware of; displayed in white, no sound or a single tone.

Advisory : Abnormal but not immediately threatening; yellow, no sound or a single tone.

Warning : Clear fault threatening safety; yellow with continuous tone.

Alarm : Serious fault threatening safety; red, persistent high‑volume tone, cannot be cleared until resolved.

Urgent Alarm : Critical fault rapidly deteriorating; red, nonstop high‑volume tone, cannot be cleared.

2) Alert Delivery Methods

PFD display : Shown on primary flight display.

ND display : Shown on navigation display.

EICAS display : Shown on engine instrument display.

Other panel displays : Shown on various cockpit panels.

Master red alarm : Red master alarm light.

Master yellow alarm : Yellow master alarm light.

Dedicated alarm light : Specific colored light for the alarm.

Audio alarm : Various sound alerts.

Voice alarm : Spoken alert.

Other alerts : Vibration of control stick, etc.

3) Monitoring Coverage Types

Bleed‑air system : High‑pressure air for pressurization, anti‑icing, pneumatic pumps, A/C, start.

Autopilot system : Handles >95% of flight time.

Communication system : Digital data link between aircraft and ground.

Electrical system : Detailed status of power distribution.

Engine system : Most critical and expensive component.

Fire detection : Cabin fire alarms and smoke detection.

Flight control : Control surfaces and flight computers.

Flight management & navigation : Advanced autopilot and navigation.

Additional categories include fuel, hydraulics, landing gear, flight‑protection, terrain, attitude, wind‑shear, etc., illustrating a multi‑dimensional monitoring approach.

4) Sample Alert Information

Alert name: Crew oxygen pressure low Level: Advisory Method: EICAS display: Yellow "CREW OXYGEN LOW" Trigger: Low pressure in backup oxygen cylinder Notes: Detailed status available in maintenance view; backup oxygen used only during pressure loss or cabin smoke.

Alert name: Autopilot failure Level: Alarm (escalates to urgent during auto‑landing) Method: EICAS display: Red "AUTOPILOT DISC", audible tone, red master alarm Trigger: Autopilot cannot stay in commanded state or flight computer relinquishes control Notes: Pulling the control stick and pressing the autopilot button switches to manual control (PFD shows F/D mode).

2. Overall Approach to Centralized Monitoring

Enterprise production systems rely on stable operation of data‑center environment, network, servers, software, databases, middleware, applications, and transaction layers. Many organizations suffer from information silos and duplicated monitoring tools. Common problems include lack of continuous optimization, tool redundancy, and insufficient data aggregation for performance and incident analysis.

Basic monitoring goal: "no missed alerts, few false alerts, high response".

Source‑side tools focus on "no missed alerts, few false alerts"; centralized platform focuses on "few false alerts, high response".

Source‑side tools should be layered to define coverage requirements.

Centralized platform aggregates performance metrics and alerts from source tools to provide common capabilities.

Data‑driven quantification of the three goals enables continuous optimization.

Combine monitoring metrics with logs, configuration, and workflow data, applying algorithms to further improve the goals.

Based on these principles, a layered monitoring architecture is proposed, ensuring coverage, enriching tool capabilities, and continuously enhancing monitoring through intelligence.

3. Layered View of Source‑Side Monitoring Tools

To manage many tools, organizations should classify tools by layer and define the capabilities each layer must provide. Existing custom metrics often reside in different tools; a structured integration plan helps replace or retain tools while maintaining coverage.

3.1 Monitoring Layer Architecture

1) Infrastructure Layer

Status monitoring: power, HVAC, network device health.

Performance monitoring: CPU, memory, session count, port traffic, memory usage.

Network monitoring: packet loss, latency, throughput.

Capacity monitoring: load utilization, bandwidth usage, outbound traffic distribution.

2) Server Layer

Storage: disk read/write errors, timeouts, disconnections.

Hardware: memory, NIC speed, power voltage, fan speed, RAID status.

Virtual machines: vCenter health.

3) Platform Service Layer

Operating system: CPU, memory, disk I/O, network I/O, connections, processes, file handles.

Database: connection count, slow SQL, missing indexes, parallel sessions, cache hit rate, replication lag, lock status.

Containers: cluster resource load, component health, node performance, TPS/QPS, circuit‑break, rate‑limit, timeout counts.

4) Application Service Layer

Service availability: service/port existence, dead‑lock detection.

Application performance: transaction volume, success/failure rate, response time, GC metrics, thread count, deadlocks.

Call tracing: request volume, latency, timeout, rejection, URL health, slow SQL, exception counts.

Business transaction: order flow, logs, error logs.

Business status: whether the application meets operational requirements.

5) Customer Experience Layer

Synthetic user monitoring: simulate user visits, verify response data, assess availability, performance, and functional correctness.

4. Unified Event/Alert Management

Effective monitoring should present only the information that requires human attention; automated self‑healing should be handled without cluttering the view. Event integration includes aggregation, convergence, grading, and analysis.

Event aggregation: combine events from different layers and domains.

Event convergence: collapse repetitive alerts from the same fault.

Event grading: define standardized levels such as Notification, Warning, Alarm.

Event analysis: build correlation graphs across infrastructure, applications, and business transactions.

4.1 Event Enrichment

Beyond detection, enriched events provide detailed descriptions, topology context, and knowledge‑base references, reducing the time needed for fault isolation and remediation.

4.2 Event Grading and Handling

Alarm : Business‑impacting, requires immediate action.

Warning : Abnormal but not yet impacting business; needs attention and may upgrade if unattended.

Notification : Informational, e.g., daily login counts.

Handling strategies include push notifications (WeChat, SMS), automated phone calls for urgent alarms, visual dashboards with color‑coded severity, and publicizing delayed handling to improve accountability.

5. Unified Performance Metric Data

Integrating raw performance data from multiple sources into a single view enables comprehensive analysis across infrastructure, applications, and business layers, supporting both offline reporting and real‑time alert escalation.

6. Monitoring Data Operations

Continuous improvement should focus on the three core goals: no missed alerts, few false alerts, high response. Quantitative metrics such as MTBF, MTTI, MTTK, MTTF, and MTTR guide optimization, while automation and intelligent wearables can further accelerate response.

7. God‑View Monitoring

Future monitoring will adopt digital‑twin and AIOps concepts to provide a global, online, observable, penetrable, and predictive "god‑view" of the system, improving precision, reducing manual threshold configuration, supporting cloud‑native environments, and expanding coverage to business and customer‑experience layers.

Source: "Operations Road" public account. View original article .
monitoringoperationsalert managementAIOpsdigital twincentralized monitoring
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.