Big Data 14 min read

How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems

ZEUS, Suning’s decision analysis platform, integrates monitoring data from tools like Baymax and HIRO, applies CEP aggregation and Drools rule evaluation, and leverages big‑data storage and machine‑learning models to automatically identify root causes, provide real‑time alerts, and enable self‑healing in large‑scale distributed systems.

Suning Technology
Suning Technology
Suning Technology
How ZEUS Turns Monitoring Data into Automated Decisions for Enterprise Systems

1. Background

In today’s internet era, enterprises adopt distributed system design and micro‑services, resulting in complex internal relationships and low integration. Existing log monitoring tools lack full‑link monitoring, making problem localization and root‑cause analysis time‑consuming, and there is no automated decision‑control (self‑healing) mechanism.

2. About MuJia

MuJia is the middleware R&D center’s end‑to‑end system monitoring product line at Suning Data Cloud. Four products have been launched: Baymax, HIRO, the intelligent alarm platform APOLLO, and ZEUS.

3. ZEUS Overview

What is Decision Analysis

Decision analysis uses massive historical and real‑time data, applying big‑data analytics, rule matching, and self‑learning to transform data into user‑readable decision information, reducing low‑level information processing and improving decision quality and efficiency.

ZEUS Introduction

ZEUS aggregates raw monitoring data, uses CEP for aggregation and Drools for rule judgment, and provides root‑cause hints to shorten problem localization time. Key features include:

Rule engine library built from Suning system and IDC environment models, supporting user‑defined rules.

Precise, second‑level response and automated decision suggestions.

System portrait, resource control, and promotional monitoring views.

Extensible analysis capabilities with open platform interfaces.

Integration with Suning IaaS/PaaS and operation management platforms for auto‑scaling and fault handling.

Support for SaaS public cloud and private deployment.

4. ZEUS Architecture

ZEUS consists of six layers from top to bottom: source data layer, data access layer, decision analysis layer, data storage layer, portal layer, and external output layer.

Source Data Layer

Collects data from Baymax, HIRO, hosts, release events, CDN, and mobile apps. Future plans include integrating user profiles, backend systems, IDC, and business holographic data.

Data Access Layer

Provides real‑time and offline data ingestion methods for different scenarios.

Decision Analysis Layer

Core layer handling real‑time and offline analysis. Real‑time analysis aggregates streaming data for immediate decisions; offline analysis processes long‑term data, matching user‑defined, expert, and machine‑learning rules. Machine learning uses supervised learning from expert systems and user feedback.

Example rule files:

epl.xml :

rule.drl :

Data Storage Layer

Raw data stored in HDFS; processed data stored in HBase, Hive, Druid, and MySQL.

Portal Layer

Provides data visualization and interaction, allowing rule configuration, decision event viewing, and system health monitoring.

External Output Layer

Sends decision events to alarm platforms, unified monitoring during promotions, and CD, PCP, ITSM platforms.

5. Core Process

Platform receives monitoring data from sources such as Baymax, HIRO, Zabbix, aggregates and analyzes via CEP.

Aggregated events enter the rule engine, generating EPL and DRL files for rule evaluation.

Results are stored in HDFS.

Aggregated results are further processed and displayed to users or sent to APOLLO for alerts.

6. Technical Challenges

Data Association Storage : Integrating diverse data types for unified analysis and efficient storage.

Rule Complexity : Managing multi‑condition rules without excessive implementation difficulty.

Rule Extraction : Deriving valuable decision rules from historical data using statistical methods and machine learning.

7. Rule Definition Example

A flexible rule set must define the problem scenario, collect real‑time feature data from application, service, and host layers, and extract indicators with thresholds.

Rule definition logic includes factors such as metric set, metric, statistical function (avg, sum, count, max, min, tp, variance), relational operators, thresholds, and logical operators.

Example composite rule for a promotion event:

(response timeout count > n) && (network traffic > m) && (packet loss rate > p)

8. Future Plans

Enrich product functions with user feedback, problem replay, and multi‑dimensional root‑cause analysis.

Deeply apply machine learning, deep learning, and NLP to achieve intelligent decision making and reduce reliance on expert systems.

Increase platform openness, sharing raw data and decision results for collaborative data mining.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

rule engineBig Dataself-healingdecision analysis
Suning Technology
Written by

Suning Technology

Official Suning Technology account. Explains cutting-edge retail technology and shares Suning's tech practices.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.