System Capacity Checklist: Key Metrics Every Architect Should Track
Architects should treat system capacity like a pre‑flight checklist, using this comprehensive guide to monitor resource usage across services, databases, and queues, and to define business metrics and state‑machine indicators that reveal bottlenecks and guide scaling decisions.
As pilots run a checklist before takeoff, architects should verify system capacity to identify bottlenecks, prioritize optimizations, and plan scaling for traffic spikes.
Resource Usage
Service Instances
Number of instances, worker threads per instance, MQ consumer group threads
Peak QPS
Interface response times: average, 95th percentile, 99th percentile, max
Peak CPU usage
Error count per second
Peak JVM heap usage
GC pause time
Disk usage (if applicable, usually no local storage)
MySQL
Number of shards, tables, replicas, routing rules
Peak QPS, TPS, read/write ratio
Peak CPU usage
Disk usage
Hotspots or data skew
Total row count
Master‑slave replication/synchronization lag (ms)
Slow query count per second
Long transaction count per second
Redis
Instance count, cluster mode
Peak QPS, TPS, read/write ratio
Peak CPU usage
Peak memory usage
Total key count
Presence of hotspot instances or keys
HBase
Instance count, region count
Peak CPU usage
Disk usage
Total row count
Compaction time windows
Hotspot instances or data skew
ElasticSearch
Instance count, shard count, routing rules
Document count
Peak CPU usage
Disk usage
Hotspot instances or data skew
Message Queue
Instance count, partition count
Peak message TPS
Consumer backlog size
Peak CPU usage
Message retention period
Business Metrics
Core Process Metrics
Define metrics according to your system, such as success rate, failure rate, counts, durations, participants, and monetary values.
State Machine Flow
The diagram shows a state machine where transitions are limited (e.g., 1→2 only, not 2→1), certain pre‑states are required, intermediate states should not linger, and final states are success (5) or failure (6).
Key business indicators to monitor include the count of each state, overall success rate, total process duration, and the number and duration of items stuck in intermediate states.
Java Baker
Java architect and Raspberry Pi enthusiast, dedicated to writing high-quality technical articles; the same name is used across major platforms.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
