Operations 5 min read

System Capacity Checklist: Key Metrics Every Architect Should Track

Architects should treat system capacity like a pre‑flight checklist, using this comprehensive guide to monitor resource usage across services, databases, and queues, and to define business metrics and state‑machine indicators that reveal bottlenecks and guide scaling decisions.

Java Baker
Java Baker
Java Baker
System Capacity Checklist: Key Metrics Every Architect Should Track

As pilots run a checklist before takeoff, architects should verify system capacity to identify bottlenecks, prioritize optimizations, and plan scaling for traffic spikes.

Resource Usage

Service Instances

Number of instances, worker threads per instance, MQ consumer group threads

Peak QPS

Interface response times: average, 95th percentile, 99th percentile, max

Peak CPU usage

Error count per second

Peak JVM heap usage

GC pause time

Disk usage (if applicable, usually no local storage)

MySQL

Number of shards, tables, replicas, routing rules

Peak QPS, TPS, read/write ratio

Peak CPU usage

Disk usage

Hotspots or data skew

Total row count

Master‑slave replication/synchronization lag (ms)

Slow query count per second

Long transaction count per second

Redis

Instance count, cluster mode

Peak QPS, TPS, read/write ratio

Peak CPU usage

Peak memory usage

Total key count

Presence of hotspot instances or keys

HBase

Instance count, region count

Peak CPU usage

Disk usage

Total row count

Compaction time windows

Hotspot instances or data skew

ElasticSearch

Instance count, shard count, routing rules

Document count

Peak CPU usage

Disk usage

Hotspot instances or data skew

Message Queue

Instance count, partition count

Peak message TPS

Consumer backlog size

Peak CPU usage

Message retention period

Business Metrics

Core Process Metrics

Define metrics according to your system, such as success rate, failure rate, counts, durations, participants, and monetary values.

State Machine Flow

State machine flow diagram
State machine flow diagram

The diagram shows a state machine where transitions are limited (e.g., 1→2 only, not 2→1), certain pre‑states are required, intermediate states should not linger, and final states are success (5) or failure (6).

Key business indicators to monitor include the count of each state, overall success rate, total process duration, and the number and duration of items stuck in intermediate states.

MonitoringarchitectureoperationsMetricscapacity
Java Baker
Written by

Java Baker

Java architect and Raspberry Pi enthusiast, dedicated to writing high-quality technical articles; the same name is used across major platforms.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.