Operations 15 min read

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

This article outlines how Ping An’s IT operations team systematically prepares for high‑traffic business events—detailing service assessment, architecture mapping, configuration audits, monitoring design, capacity planning, stress testing, and coordinated incident response—to guarantee reliability and performance under massive concurrent loads.

Efficient Ops

Feb 17, 2020

How Top IT Ops Teams Ensure Seamless Large-Scale Business Events

01 Identify IT Service Status and Risk Points

Knowing the current service status and potential risks is essential for planning large‑scale business activities. The IT team must comprehensively review services, align with core business processes, and create targeted emergency plans.

Activity Scenario Mapping

Understanding the activity scenario is the first step. Different formats—flash sales, free insurance giveaways, office sign‑ins—exhibit distinct user behaviors and system performance requirements.

For example, flash sales concentrate login attempts at a specific time and target a single URL; free insurance giveaways see sustained logins with spikes during promotional periods; office sign‑ins have peak windows with concentrated load.

Live streaming events during Chinese New Year combine office sign‑in characteristics with high‑peak login, stream ingress/egress, and additional features such as gifting and IM, which must be classified for protection levels.

After mapping scenarios, deliverables include activity type, maximum concurrent users and requests, core business scenes, and activity time windows.

Application Architecture Mapping

With scenario and protection level defined, map component upstream/downstream relationships and document the application architecture.

When designing architecture, ensure it matches the activity type—for flash sales, avoid high‑frequency relational DB operations; use asynchronous messaging, caching, etc., to handle front‑end spikes.

Distinguish “critical” data flows that must never be degraded from “value‑added” flows that can be throttled or disabled during incidents.

Do not overlook network layer and data‑center resources such as ingress traffic, CDN, multi‑active setups, and cloud platform capacities.

Deliverables include architecture diagrams, core data flows, lists of critical and non‑critical services, component inventories, and underlying network/cloud/multi‑active strategies.

Configuration Information Mapping

After architecture mapping, inventory configuration details of each component—cluster counts, call strategies, JVM settings (threads, memory), as well as network bandwidth, VM I/O limits, and firewall policies.

A – Verify log files and output levels (nginx, Tomcat, SDK logs) and ensure log back‑collection for troubleshooting. B – Document core URLs and functional points for request analysis.

Monitoring Plan Review and Deployment

Inspect monitoring configurations for all explicit and implicit CIs, covering user experience, service chain, infrastructure, and business trends, focusing on availability, health, and efficiency. Special monitoring includes:

Application anomaly keyword monitoring. Non‑standard components requiring custom health‑check interfaces or logs. Third‑party service availability, capacity, and snapshot/error code capture during incidents.

Capacity Assessment and Scaling Plan

Based on current CI metrics and expected concurrency, draft an initial scaling plan, considering application instances, network ingress, CDN, database, storage, signaling, and configuration limits (max connections, file handles).

Downstream capacity must exceed upstream to avoid bottlenecks. Scaling rules should tie service degradation levels to resource thresholds (e.g., CPU > 95%).

Production Stress Testing and Monitoring Analysis

Validate scaling effects with progressive load tests, analyze reports, and identify remaining bottlenecks. After scaling a component, re‑run regression tests to ensure downstream services remain healthy. Not all performance issues require scaling; code or architectural optimizations may suffice.

Top Slow SQL / Interface Analysis

Identify and optimize top slow SQL statements and APIs, using execution plans, caching, or query refactoring to improve service efficiency.

Emergency Plan Organization

Emergency plans split into business‑level (service degradation, feature toggles, load shedding) and IT‑component‑level (restart, failover, feature switches). For large‑scale incidents, consider rate limiting, multi‑active switching, or partial functionality with prior business agreement.

02 Ensure Continuous Monitoring, No Detail Overlooked

Capture Peak Details and Front‑Line Business Insights

Record real‑time observations during traffic spikes—resource stability, performance fluctuations, error rates, network traffic, and connection limits—using monitoring data, logs, and packet captures to reconstruct events.

Also analyze user behavior during peaks; prioritize core functions while isolating non‑essential features.

In live streaming, core functions are room entry/exit and stream push/pull; auxiliary features like gifting or chat can be disabled under capacity pressure.

Comprehensive Post‑Event Review

After an event, conduct a full review of business, application, and platform metrics: monitor data analysis, top URLs/SQL, service chain performance, platform bandwidth and connection usage, and identify anomalies for remediation via architecture tweaks, scaling, or code optimization.

03 Strong Organizational Coordination Is Key

Unified Teamwork and Clear Responsibilities

Effective coordination among operations, business, product, architecture, development, testing, platform, and infrastructure teams—and external vendors—is essential for smooth activity support.

Vendor Communication

Engage vendors (CDN, Oracle, network equipment) early; their low‑level expertise can greatly enhance optimization and issue resolution.

Production Change Review and Transparency

All production changes must undergo formal review, be recorded, and remain transparent to enable rapid root‑cause analysis when issues arise.

Conclusion

Large‑scale business activity assurance demands vigilant information gathering, analysis, and emergency response capabilities, as well as continuous innovation and experience accumulation from each event.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring performance optimization capacity planning incident response IT Operations

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.