Operations 27 min read

Mastering High Concurrency: Metrics, Throttling, and Trade‑offs for Scalable Systems

This article explains the architectural mindset behind high concurrency, defines key performance metrics such as TPS, RPS, RT and VU, analyzes test results, and presents practical techniques like rate limiting, frequency limiting, degradation and caching to balance performance, availability, and resource usage.

dbaplus Community
dbaplus Community
dbaplus Community
Mastering High Concurrency: Metrics, Throttling, and Trade‑offs for Scalable Systems

What Is High Concurrency?

High concurrency is not just a raw number of requests; it is an architectural thinking pattern that helps you choose appropriate technical measures to boost a system’s processing capability while staying aligned with real‑world business needs.

Performance Is the Foundation

Performance is one of the core goals of high concurrency. It directly influences other capabilities such as availability and data consistency. Common performance indicators include:

TPS (Transactions Per Second) – number of business transactions processed per second.

RPS (Requests Per Second) – number of incoming requests per second (often called QPS).

RT (Response Time) – latency from request issuance to response, usually measured in ms or µs.

VU (Concurrent Users) – number of users simultaneously issuing requests.

These metrics should be evaluated together; for example, high TPS with a long RT is meaningless.

Performance Test Insights

A sample test shows TPS rising with VU until a saturation point, after which RT grows sharply. When RT ≤ 25 ms, the relationship TPS = VU / RT holds, yielding a maximum TPS of 65 000 and a tolerable VU of 1 625. Adjusted conclusions consider percentile response times (P99 = 50 ms, P95 = 36 ms, P90 = 23 ms) and a target average RT ≤ 30 ms.

Beyond simple formulas, Little’s Law can be used for more accurate capacity estimation, though it is omitted here for brevity.

Control Techniques: The Three “Moves”

1️⃣ Rate Limiting (限流)

Rate limiting caps the request rate within a time window. Two classic algorithms are:

Leaky Bucket – requests enter a bucket and are drained at a constant rate, smoothing bursts.

Token Bucket – tokens are added at a steady rate; a request proceeds only if a token is available.

Both have trade‑offs: leaky bucket guarantees resource protection but may drop bursts; token bucket preserves average throughput but can starve external services.

2️⃣ Frequency Limiting (降频)

Frequency limiting controls the rate of requests sharing a specific characteristic (e.g., per user, IP, device). Rules typically follow a three‑part pattern: feature + time window + request count. The limit should reflect realistic human interaction rates, not just raw performance capacity.

Implementation can use challenge‑response or outright rejection, often combined with progressive layers to reduce false positives.

3️⃣ Degradation (降级)

When a system is overloaded, degradation trims non‑essential services, preserving core functionality. Identify core services (e.g., login, order placement) using a 20/80 analysis, then map service‑critical paths. During overload, downgrade non‑core or quasi‑core services first, optionally applying rate limiting instead of outright denial.

Caching as a Performance Lever

Caching hot data can dramatically improve read latency, but static vs. dynamic data must be considered. Over‑aggressive caching of dynamic data can cause stale reads; cache expiration times become the “balance point” between performance gain and consistency loss.

Single‑Node vs. Global Rate Limiting

Leaky‑bucket and token‑bucket algorithms are inherently single‑node. Single‑node limiting protects each instance but may cause uneven request rejection across a cluster. Global limiting enforces a cluster‑wide quota, reducing overall request loss but adding coordination overhead and potential performance penalties under high load.

Choosing between them depends on the acceptable trade‑off between precision and overhead.

Takeaways

High concurrency design is a series of informed trade‑offs: improve performance while managing cost, balance availability against consistency, and apply appropriate control mechanisms (rate limiting, frequency limiting, degradation, caching) to keep systems responsive under bursty traffic.

Understanding these principles equips engineers to evolve architectures responsibly rather than chasing a single “magic” solution.

Performance chart
Performance chart
Performance chart 2
Performance chart 2
Queue model illustration
Queue model illustration
Frequency limiting flowchart
Frequency limiting flowchart
Service importance matrix
Service importance matrix
Single vs. global limiting comparison
Single vs. global limiting comparison
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

System Designcachinghigh concurrencyperformance metricsdegradation
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.