Operations 14 min read

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

This article explores comprehensive service degradation techniques—including automatic and manual switchovers, read/write and multi‑level fallback strategies, and practical examples like timeout, failure count, and traffic throttling—to ensure core functionality remains available during traffic spikes or component failures in high‑concurrency systems.

ITFLY8 Architecture Home

Oct 11, 2016

Mastering Service Degradation: Strategies to Keep High‑Concurrency Systems Running

Source: http://jinnianshilongnian.iteye.com/blog/2306477

When building high‑concurrency systems, three essential tools protect the system: caching, degradation, and rate limiting. Previous articles have covered caching and rate limiting; this article focuses on degradation.

The goal of degradation is to keep core services available, even if the service is provided with reduced functionality. Some services cannot be degraded (e.g., adding to cart, checkout).

Degradation Plans

Before implementing degradation, review the system to determine which components must be protected at all costs and which can be safely degraded. This can be guided by log‑level based plans:

Normal: Occasionally, services may timeout due to network jitter or deployment; they can be automatically degraded.

Warning: If a service’s success rate fluctuates (e.g., between 95%‑100%), it can be automatically or manually degraded, and an alert is sent.

Error: When availability drops below 90%, the database connection pool is exhausted, or traffic suddenly exceeds the system’s maximum threshold, automatic or manual degradation may be triggered.

Critical Error: In special cases where data is corrupted, an urgent manual degradation is required.

Degradation can be classified by automation (automatic switch vs. manual switch), by function (read‑service degradation, write‑service degradation), and by system layer (multi‑level degradation).

The degradation points are mainly considered from the server‑side call chain, identifying where degradation is needed:

Page degradation: During major promotions or special events, certain pages may consume scarce resources; the entire page can be degraded to preserve core services.

Page fragment degradation: For example, if the merchant section on a product detail page has data errors, that fragment can be degraded.

Asynchronous request degradation: If asynchronous components such as recommendations or delivery info load slowly or fail, they can be degraded.

Service feature degradation: Non‑essential services like related categories or hot‑sale lists can be omitted when they encounter issues.

Read degradation: In multi‑level cache scenarios, if the backend service fails, the system can fall back to read‑only cache, suitable for scenarios with relaxed read consistency requirements.

Write degradation: In flash‑sale scenarios, updates can be written to cache first and synchronized to the database asynchronously, achieving eventual consistency while degrading the DB write path.

Crawler degradation: During peak events, crawler traffic can be redirected to static pages or return empty data to protect backend resources.

Automatic Switch Degradation

Automatic degradation is triggered based on system load, resource usage, SLA metrics, etc.

Timeout degradation: When a database, HTTP service, or remote call responds slowly and the service is non‑core, it can be automatically degraded after a timeout. For example, recommendation or review sections on a product page can be omitted without significantly affecting the purchase flow.

Relevant configuration articles on HTTP client timeout settings and DBCP timeout settings provide guidance on setting appropriate timeout values and retry mechanisms.

Failure‑count degradation: Unstable external APIs (e.g., airline ticket services) can trigger automatic degradation after a certain number of failures, with periodic health checks to restore service when it recovers.

Fault degradation: If a remote service is down (network/DNS failure, HTTP error status, RPC exception), fallback options include returning default values, using fallback data, or serving cached data.

Rate‑limit degradation: During flash‑sale events, excessive traffic may cause system collapse; rate limiting can trigger degradation, redirecting users to a queue page, out‑of‑stock notice, or an error page suggesting retry later.

Manual Switch Degradation

During large promotions, monitoring may reveal problematic services that need to be temporarily disabled. Tasks that depend on overloaded databases can be paused, and processing modes can be switched (e.g., from synchronous to asynchronous). Switches can be stored in configuration files, databases, Redis, or ZooKeeper, and synchronized periodically (e.g., every second) to decide degradation based on a key’s value.

New services undergoing gray‑release can also use switches to roll back to the previous version if issues arise. Multi‑data‑center deployments can switch traffic between rooms using switches when a data center fails.

Feature‑specific switches can temporarily hide problematic functionalities, such as faulty product specification data that cannot be fixed by a rollback.

Read‑Service Degradation

Typical strategies include switching to read‑only cache or static content, or completely blocking read access. In multi‑level cache architectures (edge cache → local cache → distributed cache → RPC/DB), switches at the edge or application layer can automatically degrade to avoid calling downstream services when they are unhealthy, suitable for scenarios with low read‑consistency requirements.

Page degradation, fragment degradation, and asynchronous request degradation are all forms of read‑service degradation aimed at preserving core resources.

Static‑page fallback: during high‑traffic events, dynamic pages can be replaced with pre‑generated static pages to reduce core resource usage and improve performance. Conversely, if static pages fail, the system can revert to dynamic rendering.

Write‑Service Degradation

Write services are generally non‑degradable, but indirect tactics such as converting synchronous operations to asynchronous ones or limiting write volume can help. Example inventory deduction approaches:

Plan 1: Deduct from DB, then update Redis.

Plan 2: Deduct from Redis, then sync to DB; roll back Redis if DB update fails.

Plan 3: Deduct from Redis, sync to DB; if DB cannot keep up, degrade to sending a message to asynchronously deduct DB inventory, achieving eventual consistency.

Plan 4: Deduct from Redis, send a local message for asynchronous DB deduction when performance is insufficient.

These approaches allow normal synchronous deduction under normal load and switch to asynchronous deduction during spikes, protecting the system. Similar tactics apply to order creation, user reviews, etc.

Multi‑Level Degradation

Caching is most effective close to the user; degradation is most protective when applied close to the user as well. Business complexity often pushes QPS/TPS bottlenecks deeper in the stack.

Page‑JS degradation switch: Controls page‑level feature toggles via JavaScript.

Access‑layer degradation switch: Controls entry‑point degradation; requests first pass through the access layer where switches can trigger automatic or manual degradation. Refer to “JD Product Detail Page Service Closed‑Loop Practice” for examples.

Application‑layer degradation switch: Controls business‑level degradation within the application.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

system reliability High concurrency service degradation backend operations fallback strategies

Written by

ITFLY8 Architecture Home

ITFLY8 Architecture Home - focused on architecture knowledge sharing and exchange, covering project management and product design. Includes large-scale distributed website architecture (high performance, high availability, caching, message queues...), design patterns, architecture patterns, big data, project management (SCRUM, PMP, Prince2), product design, and more.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.