How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems
ZEUS, Suning’s decision analysis platform, integrates monitoring data from tools like Baymax and HIRO, applies CEP aggregation and Drools rule evaluation, and leverages big‑data storage and machine‑learning models to automatically identify root causes, provide real‑time alerts, and enable self‑healing in large‑scale distributed systems.
1. Background
In today’s internet era, enterprises adopt distributed system design and micro‑services, resulting in complex internal relationships and low integration. Existing log monitoring tools lack full‑link monitoring, making problem localization and root‑cause analysis time‑consuming, and there is no automated decision‑control (self‑healing) mechanism.
2. About MuJia
MuJia is the middleware R&D center’s end‑to‑end system monitoring product line at Suning Data Cloud. Four products have been launched: Baymax, HIRO, the intelligent alarm platform APOLLO, and ZEUS.
3. ZEUS Overview
What is Decision Analysis
Decision analysis uses massive historical and real‑time data, applying big‑data analytics, rule matching, and self‑learning to transform data into user‑readable decision information, reducing low‑level information processing and improving decision quality and efficiency.
ZEUS Introduction
ZEUS aggregates raw monitoring data, uses CEP for aggregation and Drools for rule judgment, and provides root‑cause hints to shorten problem localization time. Key features include:
Rule engine library built from Suning system and IDC environment models, supporting user‑defined rules.
Precise, second‑level response and automated decision suggestions.
System portrait, resource control, and promotional monitoring views.
Extensible analysis capabilities with open platform interfaces.
Integration with Suning IaaS/PaaS and operation management platforms for auto‑scaling and fault handling.
Support for SaaS public cloud and private deployment.
4. ZEUS Architecture
ZEUS consists of six layers from top to bottom: source data layer, data access layer, decision analysis layer, data storage layer, portal layer, and external output layer.
Source Data Layer
Collects data from Baymax, HIRO, hosts, release events, CDN, and mobile apps. Future plans include integrating user profiles, backend systems, IDC, and business holographic data.
Data Access Layer
Provides real‑time and offline data ingestion methods for different scenarios.
Decision Analysis Layer
Core layer handling real‑time and offline analysis. Real‑time analysis aggregates streaming data for immediate decisions; offline analysis processes long‑term data, matching user‑defined, expert, and machine‑learning rules. Machine learning uses supervised learning from expert systems and user feedback.
Example rule files:
epl.xml :
rule.drl :
Data Storage Layer
Raw data stored in HDFS; processed data stored in HBase, Hive, Druid, and MySQL.
Portal Layer
Provides data visualization and interaction, allowing rule configuration, decision event viewing, and system health monitoring.
External Output Layer
Sends decision events to alarm platforms, unified monitoring during promotions, and CD, PCP, ITSM platforms.
5. Core Process
Platform receives monitoring data from sources such as Baymax, HIRO, Zabbix, aggregates and analyzes via CEP.
Aggregated events enter the rule engine, generating EPL and DRL files for rule evaluation.
Results are stored in HDFS.
Aggregated results are further processed and displayed to users or sent to APOLLO for alerts.
6. Technical Challenges
Data Association Storage : Integrating diverse data types for unified analysis and efficient storage.
Rule Complexity : Managing multi‑condition rules without excessive implementation difficulty.
Rule Extraction : Deriving valuable decision rules from historical data using statistical methods and machine learning.
7. Rule Definition Example
A flexible rule set must define the problem scenario, collect real‑time feature data from application, service, and host layers, and extract indicators with thresholds.
Rule definition logic includes factors such as metric set, metric, statistical function (avg, sum, count, max, min, tp, variance), relational operators, thresholds, and logical operators.
Example composite rule for a promotion event:
(response timeout count > n) && (network traffic > m) && (packet loss rate > p)
8. Future Plans
Enrich product functions with user feedback, problem replay, and multi‑dimensional root‑cause analysis.
Deeply apply machine learning, deep learning, and NLP to achieve intelligent decision making and reduce reliance on expert systems.
Increase platform openness, sharing raw data and decision results for collaborative data mining.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Suning Technology
Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
