How SAE’s Cloud‑Native Event Center Tackles Data Explosion and Real‑Time Alerts
The article explains the design and implementation of the Serverless Application Engine (SAE) Event Center, highlighting its cloud‑native architecture, the distinction from traditional monitoring, challenges like data explosion and full GC, and the distributed‑cache solution that enables efficient real‑time event aggregation, notification, and future AI‑driven diagnostics.
Background
The SAE Event Center is built to provide a higher‑level entry point for users to view and manage abnormal events across the platform, offering notification and alert capabilities beyond the low‑level Kubernetes native events that most users find difficult to interpret.
Overall Architecture
The system consists of two main parts: resource services and the event center.
Resource Services
K8s: collects native Kubernetes events such as pod, workload, and network records.
Fast System: a hundred‑millisecond‑level service for Web scenarios, storing instance information and version‑switch events.
Event Center
Event Consumption: real‑time consumption of raw logs.
Event Diagnosis: cleans and caches massive raw data.
Event Generation: writes cleaned events to the event store based on a fixed event model.
Event Message Rule Subscription: delivers generated events via DingTalk, SMS, email, etc., according to user‑defined rules.
Technical Challenges
Monitoring vs. Event Center – Monitoring focuses on abnormal metric alerts, while the Event Center abstracts diagnostics for urgent, user‑facing events, enabling one‑click subscription and proactive alerts.
Data Explosion – Web’s elastic instances and frequent version‑switch failures generate far more event data than the K8s‑based microservice side, risking storage overload.
Full GC – The original Java implementation loaded all cached events into a single HashMap, causing massive memory usage and frequent full garbage collections under high load.
Solution
A distributed‑cache based aggregation pipeline is introduced to mitigate data explosion:
Define a unique key as appId_versionId_eventType (e.g., app1_version1_scaleUpSuccess).
Event Consumption & Cache Initialization : If the key is absent, initialize cache with count=1 and instanceId=1 (or omit for failures).
Event Consumption & Cache Update : If the key exists, increment count and aggregate instanceIds (e.g., count=2, instanceIds=id1,id2).
Event Generation : After a time threshold (e.g., 60 seconds), dequeue the cache entry, remove it, and write a standardized SAE event to the event store.
To avoid duplicate writes from multiple instances, a distributed lock is applied to the queue, ensuring only one instance generates the final event.
Optimizations
All event types now follow the enqueue‑dequeue model, reducing backlog.
Cache data is sharded across multiple nodes, shrinking the in‑memory footprint from hundreds of megabytes to a fraction, eliminating full GC spikes.
Future Outlook
The Event Center will integrate richer diagnostics and AI‑driven analysis, allowing users to pinpoint issues automatically and achieve true one‑click fault localization and simplified operations.
Alibaba Cloud Native
We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
