System Capacity Design and Evaluation: From Event Planning to QPS Estimation
This article explains how to design and evaluate system capacity by using a sports event example, defining key metrics such as QPS, TPS and concurrency, applying the 80/20 rule, performing stress tests, and calculating required instances for reliable operation.
Background
Every year the organization holds a sports meet with a 2000 m run. Approximately 60 participants (40 men, 20 women) register, but only one rubber track is available, allowing ten runners per race, thus requiring at least six races. Each race, including preparation and cleanup, is assumed to take 30 minutes, resulting in a total of three hours for the event.
In a special year the 4000 m race was cancelled, increasing 2000 m registrations by 50 people, which broke the original schedule and forced half of the participants to be postponed to the following weekend, illustrating the consequences of not re‑estimating capacity when demand changes.
Concept
Design capacity is the technical process of estimating system capacity using various strategies; it is a core skill for architects.
Capacity design requires concrete data such as data volume, concurrency, bandwidth, user counts, message length, image size, storage, CPU, and memory.
The article uses concurrency as an example to demonstrate the analysis process.
Analysis Process
Understanding Key Metrics
TPS (Transactions Per Second) – number of transactions processed each second.
QPS (Queries Per Second) – number of requests processed each second, a common throughput metric.
Concurrency – the number of simultaneous requests the system can handle, reflecting load capacity.
Peak‑QPS calculation: 1) Principle – 80 % of traffic occurs in 20 % of the time (peak period). 2) Formula – (Total PV × 80 %) / (Seconds per day × 20 %) = Peak QPS.
Definitions: PV (Page View), UV (Unique Visitor), throughput, response time (RT), and the relationship QPS = Concurrency / Average Response Time (or Concurrency = QPS × Average Response Time).
When to Evaluate System Capacity
1) Temporary traffic spikes (e.g., 618, Double 11, holiday promotions) where traffic may increase severalfold.
2) Initial system capacity assessment before a new system goes live.
3) Changes in capacity baseline when functionality, data volume, or active users grow, requiring re‑evaluation and scaling.
Evaluation Steps
1. Analyze Daily Total Visits
Collect realistic daily PV/UV numbers from the system or estimate them for new services; product and operations teams should provide expected traffic.
Example: an activity pushes 20 million messages in one hour, with a 10 % click‑through rate, yielding 2 million additional visits.
2. Estimate Average QPS
Assume normal activity lasts about 11 hours (≈ 40 000 seconds). Average QPS = Total Visits / 40 000.
Example: 2 million visits in one hour → 2 000 000 / 3600 ≈ 555.5 QPS. For a mature site like Baidu with 50 million daily PV, average QPS ≈ 1250.
3. Estimate Peak‑Period QPS
Consider both traffic‑curve analysis and the 80/20 rule.
Using a sample cloud system with average QPS = 2900, the peak QPS is about 2.58 × average → ≈ 7482 QPS.
4. Determine Single‑Instance QPS Limit
Conduct load testing (e.g., nGrinder or JMeter). The team treats a response time > 2 s as a bottleneck; the target is ≤ 1 s, so the limit is adjusted accordingly.
In the example, a Tomcat instance supports 2500 QPS, adjusted to 2000 QPS for safety.
5. Confirm Required Instances Based on Redundancy
With peak QPS ≈ 7500 and each instance handling 2000 QPS, at least four web instances are needed.
Other resources (cache, DB) are sized proportionally (e.g., 90 % cache, 10 % DB).
Case Study – Book Reservation System
Applying the 80/20 rule over a 9‑hour window (32 400 seconds) and total PV = 1 500 000 yields peak QPS ≈ 185 QPS. Concurrency = QPS × Average Response Time (0.5 s) ≈ 92.5, rounded to 100, then adjusted to 200 using a 343 estimation method.
A table shows pessimistic, normal, and optimistic capacity estimates (30 % → 80, 40 % → 100, 30 % → 300). The final recommendation for performance testing is to support > 200 concurrent users with response times of 50‑100 ms.
Summary
System design capacity should be evaluated in three scenarios: temporary traffic spikes, initial system launch, and baseline growth.
The evaluation steps are: 1) Analyze daily total visits, 2) Estimate average QPS, 3) Estimate peak‑period QPS (traffic curve or 80/20 rule), 4) Perform performance stress testing, 5) Adjust based on redundancy and actual limits.
The initial sports‑meet example demonstrates that early capacity re‑assessment could have prevented scheduling conflicts.
For further reading, see the original article by 翁智华 at https://www.cnblogs.com/wzh2010/p/14454954.html.
Architect's Guide
Dedicated to sharing programmer-architect skills—Java backend, system, microservice, and distributed architectures—to help you become a senior architect.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.