How End-State‑Oriented Monitoring Transforms Operations and AIOps
This article explains the concept of end‑state‑oriented monitoring, its significance for modern operations, the shortcomings of existing solutions, and a layered design approach that leverages real‑time data, service catalogs, and AI to achieve secure, stable, efficient, and low‑cost operations.
1. What Is End‑State Oriented
End‑state orientation is a design method in the operations field that defines four core capability domains—security, stability, efficiency, and low cost. These domains correspond to four stages of operational maturity: manual ops, automated ops, DevOps, and AIOps. The focus is on the final state of systems, users, business, and scenarios.
For a system, end‑state describes its ultimate form: users specify *what* they need, and the system automatically adjusts based on real‑time conditions and an operational knowledge base to achieve security and stability. For example, during a gray‑release process, the system dynamically executes traffic‑shaping strategies according to the release requirements.
For a user, end‑state describes the desired outcome without requiring detailed implementation logic. Users provide declarative API requests, and the system fulfills them, exposing only the final usage interface.
2. Why a Monitoring Platform Matters
As online services grow, non‑functional requirements such as availability, reliability, and business continuity become critical, demanding comprehensive monitoring. Monitoring also supports data‑storage cost control, value extraction, and data‑derived capabilities, aligning with the end‑state philosophy.
In operations, the business‑assurance domain is the core of a monitoring platform, offering full‑coverage, business‑centric visibility, fault detection, diagnosis, recovery, and prediction. Fault prediction leverages accumulated operational experience to identify patterns in data. Monitoring data become valuable assets, serving both business and technical needs, and supporting capacity and procurement planning.
Key characteristics of an end‑state‑oriented monitoring platform include:
Real‑time business link monitoring
Application‑level anomaly detection
User‑centric data aggregation
Real‑time data exchange
Unified alert convergence
From the user perspective, the platform must cover the entire fault‑handling lifecycle: pre‑monitoring, real‑time alerting, and post‑incident analysis, providing service‑catalog interpretation, automated coverage, intelligent alert thresholds, and a knowledge base for fault handling.
Data lifecycle management should enable operational monitoring, security auditing, business analysis, and data export, while acting as a hub for third‑party data integration.
Monitoring must collect data across layers—from network, servers, storage, systems, applications to business—and correlate end‑to‑end data across users, traffic, clusters, and product operations.
3. Problems with Existing Monitoring Solutions
Limited global visibility; solutions focus on specific domains or capabilities.
Inability to aggregate sub‑domain link and data relationships, leading to long troubleshooting times.
Excessive and noisy alerts caused by multiple independent monitoring tools.
Missing support for automatic data collection at business‑link, interface, and method levels, resulting in inaccurate coverage.
Insufficient data aggregation and secondary computation, reducing data asset value.
Lack of large‑scale time‑series analysis and root‑cause detection capabilities.
4. End‑State‑Oriented Monitoring Design
The design emphasizes user‑friendly interaction and service‑catalog integration for third‑party data.
Provide visualized service content to lower usage cost and increase user coverage, prioritizing business‑first monitoring.
Integrate third‑party data through a service‑mediator that offers plug‑and‑play interfaces, automatically handling addition or removal of APIs.
Assign services via the mediator, tag data, and deliver processed results to specific user groups based on real‑time, near‑real‑time, or offline processing.
Allow users to customize which services they connect to.
Data processing focuses on cleansing and metric‑layer calculations, supporting diverse sources such as Kafka, ClickHouse, or VictoriaMetrics. Real‑time ingestion, file ingestion, and real‑time aggregation are required. Metric computation must handle high throughput, standard SQL, time windows, and deduplication, adapting dynamically to tag and user characteristics.
For feedback, the platform continuously re‑examines system state to inform end‑state decisions, forming a model pipeline for anomaly detection.
Alert analysis should include temporal trend analysis, metric comparison to improve recall, and labeling of known alerts to refine models. Root‑cause analysis uses data drilling and convergence to validate algorithm‑identified primary causes and adjust thresholds.
The platform should be layered as follows: User Experience Layer, Service Capability Layer, Data Analysis Layer, Data Processing Layer, and Backend Management Layer.
5. Closing Thoughts
End‑state orientation inherits AIOps principles, using declarative APIs to deliver an unconditional user experience and machine‑learning to enhance data output. This design decouples technology, embeds data into the user experience, improves feedback loops, expands the monitoring user base, and promotes secure, stable, efficient, low‑cost operations while solving many data‑operation challenges.
The article introduces the end‑state design method as the first in a series; subsequent posts will detail the implementation of an end‑state‑oriented monitoring platform.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.