Operations 14 min read

Stability Monitoring Practices for Double 11 2017

The 2017 Double 11 stability monitoring project introduced a four‑layer monitoring architecture—including customer & sentiment, business, system water‑level, and infrastructure monitoring—along with data archiving and system‑level reliability measures to detect, respond to, and mitigate issues far faster than traditional manual processes.

Alibaba Cloud Infrastructure

Dec 21, 2017

Stability Monitoring Practices for Double 11 2017

Overview

In daily operations and stability assurance, monitoring is one of the most important means to discover problems, perceive business anomalies, and detect user experience gaps. During the unprecedented scale of Double 11 2017, any business issue would be amplified, making rapid, comprehensive detection, response, and mitigation before large‑scale user impact crucial.

Historically, issues were discovered through a user‑feedback → customer service → multi‑level sorting pipeline, then reported → resolved → follow‑up, which greatly increased handling overhead and prolonged user impact. Therefore, for Double 11 2017 we launched a dedicated stability monitoring project covering monitoring deployment, data storage & archiving, and monitoring system reliability.

Double 11 Monitoring Deployment

Problem sources in routine operations include user reports, public sentiment, and monitoring alerts. To achieve fast response and rapid containment during Double 11, we needed faster perception and execution of problem handling.

Four‑Layer Monitoring Architecture

Customer & Sentiment Monitoring: Enables rapid alerting of user‑reported issues and intelligent sorting of public sentiment, delivering problems to the appropriate handlers to form a closed loop.

Business Monitoring: Detects business anomalies from dimensions such as activity flow, loss‑prevention, and routine operations, and monitors cash flow for transparent global control and decision support.

System Water‑Level Monitoring: Abstracts baseline monitoring standards at the system and dependency layers to provide early warning and pre‑emptive handling before business‑side symptoms appear.

Infrastructure Monitoring: With a cloud‑native architecture, it proactively detects infrastructure anomalies to reduce jitter and impact on business services.

In response, alerts are surfaced via dashboards, large screens, and subscription mechanisms (group, screen, personal), with severity‑based routing to streamline the incident response chain.

In handling, risk‑based pre‑plans are linked to alerts, enabling timely recommendation of mitigation actions and rapid execution when incidents flow to responders.

Customer & Sentiment Monitoring

Customer and sentiment channels convey user‑perceived service degradation to the provider. In a massive event like Double 11, any issue can quickly magnify; thus, rapid transmission to backend investigators and swift mitigation are essential.

Previous Process: User feedback → offline filtering → business team investigation → manual feedback → management platform logging.

This manual flow suffered from high labor cost, subjective prioritization, and delayed transmission, making it unsuitable for high‑volume scenarios.

In 2017 Double 11, we employed a “robotic factory” to automatically sort, cluster, and route complaints and sentiment using algorithms.

Business Monitoring

Business monitoring observes the health of actual services from the perspective of business impact. During Double 11 we covered dimensions such as activity flow, loss‑prevention, and routine business monitoring.

Monitoring Dimensions

Activity Flow Monitoring – tracks availability across activity chain nodes (issuance, consumption, redemption).

Loss‑Prevention Monitoring – checks data consistency, timeout, reconciliation to detect potential financial loss.

Routine Business Monitoring – follows core user‑facing steps (order, payment, logistics, reverse flow).

Monitoring Coverage

We performed a comprehensive architecture‑based mapping of all core business nodes to ensure coverage, then instrumented and deployed the necessary probes.

Coverage criteria included whether a node could block business flow, affect user‑facing services, degrade user experience, or cause user loss.

Monitoring Quality Control

Given the massive scale, we focused on monitoring effectiveness and alarm fatigue. We evaluated two dimensions:

Effectiveness Validation: Does the monitor have data? Is an alarm configured? Is the alarm subscribed? Is the data used by dashboards or other systems? Have architecture, call‑chains, or logs changed? Can it reflect current anomalies?

Accuracy Validation: Alarm response rate, daily alarm volume vs. true issues, false‑positive rate, and detection rate during drills.

System Water‑Level Monitoring

This baseline monitoring captures system‑level metrics (CPU, memory, disk, network retransmission, GC counts, service success rates, Tair/TDDL read‑write success) to compute a health score and rank applications. During Double 11, water‑level alerts surfaced failures on average 60+ minutes before user reports and 3‑6 minutes before business alerts.

Infrastructure Monitoring

Infrastructure monitoring tracks the health of underlying dependencies (ECS, SLB, VPC, OSS, CDN) and cloud‑native services across business units, providing multi‑dimensional aggregation to surface problems early.

Data Storage & Archiving

All monitoring data from end‑to‑end stress testing to the Double 11 day were stored offline for post‑event review and future reference.

Monitoring System Reliability Assurance

During the promotion, traffic and log volume far exceeded normal levels, requiring guarantees for data‑interface stability and real‑time processing. We performed full‑link stress testing, prioritized core monitoring items, and applied several reliability measures:

Stress‑test verification: Deploy copy‑config in a test environment and run full‑link traffic to expose stability issues.

Network‑cut drills: Simulate outages for core dependencies to validate disaster‑recovery capabilities.

Real‑time improvements: Refactor architecture to reduce data latency from 20 s to 10 s for large‑screen displays.

Log optimization: Replace large logs with smaller ones to reduce disk consumption.

Alert degradation strategy: Rank alerts by importance and health score, ensuring core alerts remain available under pressure.

Garbage‑monitor cleanup: Reclaim resources occupied by obsolete monitors.

Conclusion

When a business system grows large, its complexity outpaces the capabilities of off‑the‑shelf open‑source monitoring solutions and single‑logic approaches. During Double 11, any minor stability issue could be magnified. Our comprehensive stability monitoring initiative, spanning front‑end user experience to back‑end technical layers, helped ensure smooth overall business operation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Monitoring Operations incident response Stability big-data

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.