Why Your System Fails Under Load: Understanding Concurrency, Latency, and QPS
The article explains how confusion between concurrency, latency, and QPS leads to systems that crash under load, using a bank‑counter analogy and a flash‑sale case study to derive a performance formula and practical optimization techniques.
When a friend asked why a system advertised to support 1,000 QPS stalls with only 500 simultaneous users, the root cause was a mix‑up of three core concepts: concurrency, latency, and QPS. The article starts with a simple bank‑counter analogy—three service windows (concurrency = 3) each taking 2 minutes per transaction (latency)—to illustrate how these metrics relate.
Precise Definitions
1. Concurrency
Concurrency is the maximum number of requests the system can process at the same time, not the number of arriving requests. It is limited by thread‑pool size, DB connection pool, CPU cores, memory, and third‑party service limits.
2. Latency
Latency is the total time from request entry to response, including network transmission, queue wait, business processing, DB query, and third‑party calls. Key latency metrics are average latency, P95 latency, and P99 latency.
3. QPS (Queries Per Second)
QPS measures throughput—how many requests are completed per second. It is often misunderstood; a high QPS claim does not guarantee the system can handle burst traffic.
Golden Formula
QPS = Concurrency ÷ AverageLatency(seconds)This reveals two optimization directions: increase concurrency (horizontal scaling) or reduce latency (vertical optimization).
Real‑World Case Study
An e‑commerce order API with concurrency = 200 and average latency = 0.8 s yields a theoretical QPS of 250, but actual tests show only 180 QPS. Reasons include queue wait time not counted in latency, hidden bottlenecks such as lock contention or GC, and uneven request distribution during spikes.
Flash‑Sale Collapse Scenario
Configuration: max concurrency = 100, per‑request latency = 500 ms, theoretical QPS = 200. Ten seconds before the sale, 50 QPS normal traffic arrives. At launch, 5,000 requests flood in; the first 100 are processed instantly, the rest queue up. Within seconds the queue grows to thousands, users experience >30 s latency, repeat clicks generate additional requests, and the system crashes.
Root‑Cause Analysis
Burst traffic far exceeds system capacity.
User behavior: latency triggers repeated clicks, worsening the load.
Design flaws: lack of rate‑limiting and degradation mechanisms.
Practical Optimization Strategies
Increase Concurrency
Technical: adjust thread‑pool settings, e.g.
// Adjust thread pool
ThreadPoolExecutor executor = new ThreadPoolExecutor(
50, // core threads
200, // max threads
60L, TimeUnit.SECONDS,
new ArrayBlockingQueue<>(1000));Architecture: micro‑service decomposition, asynchronous processing, read/write DB separation.
Infrastructure: containerized deployment for rapid scaling, load balancers, CDN acceleration.
Reduce Latency
Cache hot data with Redis:
@Cacheable(value = "product", key = "#id")
public Product getProduct(Long id) {
return productRepository.findById(id);
}Database tuning: index optimization, connection‑pool tuning, read/write separation.
Code improvements: eliminate N+1 queries, batch operations, remove dead code.
Protective Mechanisms
Rate limiting (token bucket):
// Token bucket rate limiter
@RateLimiter(limit = 100, window = 1, timeUnit = TimeUnit.SECONDS)
public Result processOrder(Order order) {
// business logic
}Circuit breaker:
// Circuit breaker protection
@CircuitBreaker(name = "payment", fallbackMethod = "fallbackPayment")
public PaymentResult processPayment(PaymentRequest request) {
// payment logic
}Graceful degradation: disable non‑core features, return default values, switch async paths to sync when needed.
Performance Evaluation
Stress testing with wrk:
# wrk stress test
wrk -t12 -c100 -d30s --script=post.lua http://api.example.com/ordersMonitor RT, TPS, error rate, resource utilization.
Scenario simulation: normal traffic, burst traffic, abnormal traffic.
Capacity Planning
RequiredConcurrency = ExpectedQPS × AverageLatency(seconds)Example: Expected QPS = 1,000, average latency = 0.2 s → required concurrency = 200. Apply a safety factor of 1.5‑2× for spikes.
Common Pitfalls & Solutions
Chasing high QPS without considering system limits → set realistic QPS caps and traffic control.
Ignoring long‑tail latency (P99) → focus on P95/P99, optimize slow queries.
Single‑node mindset → prefer horizontal scaling, service decomposition, and consistency handling.
Final Recommendations
Measure: build comprehensive monitoring.
Analyze: locate true bottlenecks.
Optimize: apply targeted improvements.
Validate: prove gains with data.
Incorporate performance thinking early, maintain a testing pipeline, prioritize user experience over raw metrics, and plan capacity with safety margins.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
