How Top IT Ops Teams Ensure Seamless Large-Scale Business Events
This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.
01 Identify IT Service Status and Risk Points
Knowing the current service status and potential risks is essential for planning large‑scale business activities. The IT team must comprehensively review services, align with core business processes, and create targeted emergency plans.
Activity Scenario Mapping
Understanding the activity scenario is the first step. Different formats—flash sales, free insurance giveaways, office sign‑ins—exhibit distinct user behaviors and system performance requirements.
For example, flash sales concentrate login attempts at a specific time and target a single URL; free insurance giveaways see sustained logins with spikes during promotional periods; office sign‑ins have peak windows with concentrated load.
Live streaming events during Chinese New Year combine office sign‑in characteristics with high‑peak login, stream ingress/egress, and additional features such as gifting and IM, which must be classified for protection levels.
After mapping scenarios, deliverables include activity type, maximum concurrent users and requests, core business scenes, and activity time windows.
Application Architecture Mapping
With scenario and protection level defined, map component upstream/downstream relationships and document the application architecture.
When designing architecture, ensure it matches the activity type—for flash sales, avoid high‑frequency relational DB operations; use asynchronous messaging, caching, etc., to handle front‑end spikes.
Distinguish “critical” data flows that must never be degraded from “value‑added” flows that can be throttled or disabled during incidents.
Do not overlook network layer and data‑center resources such as ingress traffic, CDN, multi‑active setups, and cloud platform capacities.
Deliverables include architecture diagrams, core data flows, lists of critical and non‑critical services, component inventories, and underlying network/cloud/multi‑active strategies.
Configuration Information Mapping
After architecture mapping, inventory configuration details of each component—cluster counts, call strategies, JVM settings (threads, memory), as well as network bandwidth, VM I/O limits, and firewall policies.
A – Verify log files and output levels (nginx, Tomcat, SDK logs) and ensure log back‑collection for troubleshooting. B – Document core URLs and functional points for request analysis.
Monitoring Plan Review and Deployment
Inspect monitoring configurations for all explicit and implicit CIs, covering user experience, service chain, infrastructure, and business trends, focusing on availability, health, and efficiency. Special monitoring includes:
Application anomaly keyword monitoring. Non‑standard components requiring custom health‑check interfaces or logs. Third‑party service availability, capacity, and snapshot/error code capture during incidents.
Capacity Assessment and Scaling Plan
Based on current CI metrics and expected concurrency, draft an initial scaling plan, considering application instances, network ingress, CDN, database, storage, signaling, and configuration limits (max connections, file handles).
Downstream capacity must exceed upstream to avoid bottlenecks. Scaling rules should tie service degradation levels to resource thresholds (e.g., CPU > 95%).
Production Stress Testing and Monitoring Analysis
Validate scaling effects with progressive load tests, analyze reports, and identify remaining bottlenecks. After scaling a component, re‑run regression tests to ensure downstream services remain healthy. Not all performance issues require scaling; code or architectural optimizations may suffice.
Top Slow SQL / Interface Analysis
Identify and optimize top slow SQL statements and APIs, using execution plans, caching, or query refactoring to improve service efficiency.
Emergency Plan Organization
Emergency plans split into business‑level (service degradation, feature toggles, load shedding) and IT‑component‑level (restart, failover, feature switches). For large‑scale incidents, consider rate limiting, multi‑active switching, or partial functionality with prior business agreement.
02 Ensure Continuous Monitoring, No Detail Overlooked
Capture Peak Details and Front‑Line Business Insights
Record real‑time observations during traffic spikes—resource stability, performance fluctuations, error rates, network traffic, and connection limits—using monitoring data, logs, and packet captures to reconstruct events.
Also analyze user behavior during peaks; prioritize core functions while isolating non‑essential features.
In live streaming, core functions are room entry/exit and stream push/pull; auxiliary features like gifting or chat can be disabled under capacity pressure.
Comprehensive Post‑Event Review
After an event, conduct a full review of business, application, and platform metrics: monitor data analysis, top URLs/SQL, service chain performance, platform bandwidth and connection usage, and identify anomalies for remediation via architecture tweaks, scaling, or code optimization.
03 Strong Organizational Coordination Is Key
Unified Teamwork and Clear Responsibilities
Effective coordination among operations, business, product, architecture, development, testing, platform, and infrastructure teams—and external vendors—is essential for smooth activity support.
Vendor Communication
Engage vendors (CDN, Oracle, network equipment) early; their low‑level expertise can greatly enhance optimization and issue resolution.
Production Change Review and Transparency
All production changes must undergo formal review, be recorded, and remain transparent to enable rapid root‑cause analysis when issues arise.
Conclusion
Large‑scale business activity assurance demands vigilant information gathering, analysis, and emergency response capabilities, as well as continuous innovation and experience accumulation from each event.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.