Operations 27 min read

Mastering High Concurrency: Metrics, Throttling, and Trade‑offs for Scalable Systems

This article explains the architectural mindset behind high concurrency, defines key performance metrics such as TPS, RPS, RT and VU, analyzes test results, and presents practical techniques like rate limiting, frequency limiting, degradation and caching to balance performance, availability, and resource usage.

dbaplus Community

Jul 14, 2023

Mastering High Concurrency: Metrics, Throttling, and Trade‑offs for Scalable Systems

What Is High Concurrency?

High concurrency is not just a raw number of requests; it is an architectural thinking pattern that helps you choose appropriate technical measures to boost a system’s processing capability while staying aligned with real‑world business needs.

Performance Is the Foundation

Performance is one of the core goals of high concurrency. It directly influences other capabilities such as availability and data consistency. Common performance indicators include:

TPS (Transactions Per Second) – number of business transactions processed per second.

RPS (Requests Per Second) – number of incoming requests per second (often called QPS).

RT (Response Time) – latency from request issuance to response, usually measured in ms or µs.

VU (Concurrent Users) – number of users simultaneously issuing requests.

These metrics should be evaluated together; for example, high TPS with a long RT is meaningless.

Performance Test Insights

A sample test shows TPS rising with VU until a saturation point, after which RT grows sharply. When RT ≤ 25 ms, the relationship TPS = VU / RT holds, yielding a maximum TPS of 65 000 and a tolerable VU of 1 625. Adjusted conclusions consider percentile response times (P99 = 50 ms, P95 = 36 ms, P90 = 23 ms) and a target average RT ≤ 30 ms.

Beyond simple formulas, Little’s Law can be used for more accurate capacity estimation, though it is omitted here for brevity.

Control Techniques: The Three “Moves”

1️⃣ Rate Limiting (限流)

Rate limiting caps the request rate within a time window. Two classic algorithms are:

Leaky Bucket – requests enter a bucket and are drained at a constant rate, smoothing bursts.

Token Bucket – tokens are added at a steady rate; a request proceeds only if a token is available.

Both have trade‑offs: leaky bucket guarantees resource protection but may drop bursts; token bucket preserves average throughput but can starve external services.

2️⃣ Frequency Limiting (降频)

Frequency limiting controls the rate of requests sharing a specific characteristic (e.g., per user, IP, device). Rules typically follow a three‑part pattern: feature + time window + request count. The limit should reflect realistic human interaction rates, not just raw performance capacity.

Implementation can use challenge‑response or outright rejection, often combined with progressive layers to reduce false positives.

3️⃣ Degradation (降级)

When a system is overloaded, degradation trims non‑essential services, preserving core functionality. Identify core services (e.g., login, order placement) using a 20/80 analysis, then map service‑critical paths. During overload, downgrade non‑core or quasi‑core services first, optionally applying rate limiting instead of outright denial.

Caching as a Performance Lever

Caching hot data can dramatically improve read latency, but static vs. dynamic data must be considered. Over‑aggressive caching of dynamic data can cause stale reads; cache expiration times become the “balance point” between performance gain and consistency loss.

Single‑Node vs. Global Rate Limiting

Leaky‑bucket and token‑bucket algorithms are inherently single‑node. Single‑node limiting protects each instance but may cause uneven request rejection across a cluster. Global limiting enforces a cluster‑wide quota, reducing overall request loss but adding coordination overhead and potential performance penalties under high load.

Choosing between them depends on the acceptable trade‑off between precision and overhead.

Takeaways

High concurrency design is a series of informed trade‑offs: improve performance while managing cost, balance availability against consistency, and apply appropriate control mechanisms (rate limiting, frequency limiting, degradation, caching) to keep systems responsive under bursty traffic.

Understanding these principles equips engineers to evolve architectures responsibly rather than chasing a single “magic” solution.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

System Design caching high concurrency performance metrics degradation

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.