Operations 9 min read

How to Ensure Stability for Billion-Request Websites: Proven Strategies

Ensuring stability for sites handling up to 100,000 requests per minute requires a combination of configuration management, feature toggles, phased deployment, robust error handling, comprehensive logging, real-time monitoring, traffic-aware throttling, service degradation, and disaster-recovery tactics, all of which are detailed in this guide.

Java Backend Technology

Oct 19, 2018

How to Ensure Stability for Billion-Request Websites: Proven Strategies

Stability is critical for large‑scale websites that may receive up to 100,000 requests per minute; even a small mistake can cause major failures. This article discusses practical approaches to keep such high‑traffic sites stable.

1. Basic Strategies

1.1 Configuration

Configuration centralizes business‑process data on a platform, separating it from code so that the code handles generic logic while configuration data determines specific runtime behavior. This enables rapid online changes, such as adding, modifying, or deleting configurations, which helps quickly address issues.

1.2 Business Switches

Business switches control the execution of specific processes in real time. Common types include:

Boolean: enable or disable a flow, e.g., turn a validation on or off.

Number: numeric configuration for a business scenario.

String: textual configuration.

Collection: a set of items that can activate or deactivate certain processes.

Map: key‑value mappings that direct specific handling.

1.3 Deployment Strategy

Facing 100,000 requests per minute, deployment is a major challenge. We adopt a phased, batch‑by‑batch deployment by data center and machine, allowing old and new versions to coexist. Gray‑release gradually increases the new version’s traffic share, enabling rollback if logs or user feedback indicate problems.

1.4 Error Handling

A unified error‑handling process classifies error codes to quickly identify error types. Standard prefixes such as CHK (validation failure), THD (third‑party service error), SYS (system error) and suffixes like REQUIRED (missing field), INVALID (data error), EXCEPTION (exception) help pinpoint issues.

1.5 Log Collection

Accurate, complete logs are the primary source for diagnosing online problems. Structured logs enable metrics such as service call volume, success rate, and detailed error information, and support real‑time alerting. Logs should follow a unified format and be stored in a centralized analysis service for instant search.

2. Online Monitoring Strategies

2.1 Link Tracing

Each request receives a unique identifier (e.g., UUID) at the entry point, which is propagated through every processing node. This identifier allows reconstruction of the entire request flow from massive logs, enabling full‑link analysis of inputs, outputs, and processing steps.

2.2 Exception Monitoring

Exceptions thrown by Java are logged and monitored. Monitoring focuses on exception frequency, stack traces, and trends, allowing rapid pinpointing of problematic code paths or third‑party timeouts.

2.3 Machine Monitoring

Key machine metrics—CPU, memory, network usage, JVM thread count, heap usage, Full GC frequency—are continuously observed. During traffic spikes, resource exhaustion can trigger Full GC or service crashes, making machine health a top priority.

3. Strategies for Traffic Peaks

3.1 Service Degradation

During events like Double‑Eleven, when traffic exceeds capacity, non‑essential processes (e.g., image validation) are skipped to preserve stability and user experience.

3.2 Service Rate Limiting

A pre‑estimated threshold limits request processing; requests exceeding the threshold are rejected with an error response. This self‑protective measure is essential for peak events such as flash sales.

3.3 Disaster Recovery

Single‑machine disaster recovery is near zero; distributed systems achieve high availability through multi‑active deployments and load balancers (nginx, Apache). Large services (Taobao, Tmall, WeChat, JD) employ active‑active, multi‑data‑center architectures to reduce single‑site impact.

4. Conclusion

This article outlines the key points to consider for maintaining stability in billion‑request websites, covering configuration, switches, deployment, error handling, logging, monitoring, traffic‑aware throttling, degradation, and disaster recovery.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deployment Rate Limiting large-scale systems Stability

Written by

Java Backend Technology

Focus on Java-related technologies: SSM, Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading. Occasionally cover DevOps tools like Jenkins, Nexus, Docker, and ELK. Also share technical insights from time to time, committed to Java full-stack development!

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.