Building a Full Performance Engineering Loop with Spring Boot, SkyWalking, and Prometheus
This guide walks through constructing a sustainable performance‑engineering pipeline—from monitoring and metrics collection with SkyWalking, Prometheus, and Grafana, through targeted load testing and bottleneck analysis, to capacity modeling and alert solidification—for Spring Boot services.
Performance‑Engineering Loop Overview
True performance engineering goes beyond a single load‑test run; it establishes a continuous loop that collects monitoring data, discovers bottlenecks, runs targeted stress tests, validates optimizations, evaluates capacity, solidifies alerts, and repeats.
Monitoring → Bottleneck Insight → Targeted Load Test → Optimization Verification → Capacity Evaluation → Alert Solidification → Continuous Regression
Overall Architecture
The loop combines four core tools:
SkyWalking – provides distributed tracing (“view the call chain”).
Prometheus – gathers resource and business metrics (“view resources & business indicators”).
Grafana – visualizes all facts in a unified dashboard.
Load‑testing tool – creates the problems to be investigated.
The diagram below illustrates the complete system.
1. Monitoring Stack Deployment
Deploy Prometheus, Grafana, SkyWalking OAP, SkyWalking UI, and Elasticsearch with Docker Compose:
version: '3.8'
services:
prometheus:
image: prom/prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
skywalking-oap:
image: apache/skywalking-oap-server
environment:
- SW_STORAGE=elasticsearch
ports:
- "11800:11800"
- "12800:12800"
skywalking-ui:
image: apache/skywalking-ui
environment:
- SW_OAP_ADDRESS=http://skywalking-oap:12800
ports:
- "8080:8080"
elasticsearch:
image: elasticsearch:7.10.0
environment:
- discovery.type=single-node2. Spring Boot Integration
SkyWalking Agent
java -javaagent:/opt/skywalking/agent/skywalking-agent.jar \
-Dskywalking.agent.service_name=order-service \
-jar app.jarExpose Prometheus Metrics via Micrometer
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency> management:
endpoints:
web:
exposure:
include: prometheus,health,info
metrics:
export:
prometheus:
enabled: trueMetrics are available at http://localhost:8080/actuator/prometheus.
3. Prometheus Scrape Configuration
global:
scrape_interval: 5s
scrape_configs:
- job_name: 'springboot'
metrics_path: '/actuator/prometheus'
static_configs:
- targets: ['host.docker.internal:8080']
- job_name: 'skywalking'
static_configs:
- targets: ['skywalking-oap:1234']4. Grafana Core Metrics
Key business and system indicators:
QPS : rate(http_server_requests_seconds_count[1m]) P99 latency :
histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le))Error rate :
sum(rate(http_server_requests_seconds_count{status=~"5.."}[1m])) / sum(rate(http_server_requests_seconds_count[1m]))CPU usage : process_cpu_usage JVM heap : jvm_memory_used_bytes{area="heap"} GC pauses :
rate(jvm_gc_pause_seconds_count[1m])5. Monitoring‑Driven Load Testing
The goal of a load test is to verify a bottleneck hypothesis rather than simply “run as many requests as possible.” Example hypotheses derived from observed metrics:
P99 is high → suspect code hotspot.
CPU low but QPS stalls → thread‑pool saturation.
DB connection pool full → database bottleneck.
GC spikes → memory pressure.
k6 Load‑Test Script Example
import http from 'k6/http';
import { sleep } from 'k6';
export let options = {
stages: [
{ duration: '1m', target: 50 },
{ duration: '2m', target: 200 },
{ duration: '2m', target: 500 },
],
};
export default function () {
http.get('http://localhost:8080/api/order/create');
sleep(1);
}6. Standardized Bottleneck Attribution Matrix
Typical observations, SkyWalking clues, Prometheus signals, and conclusions:
QPS not increasing – Trace normal, thread‑pool active = full → concurrency model bottleneck.
Latency spikes – Span node high, CPU low → code hotspot.
Error rate rise – Downstream call failures, DB connection pool full → database bottleneck.
Latency jitter – Trace dispersion, GC count high → JVM memory issue.
7. Capacity Evaluation Model
Define “single‑node safe capacity” as the maximum QPS that simultaneously satisfies:
P99 < SLA (e.g., 500 ms)
Error rate < 0.1 %
CPU < 70 %
DB connection pool < 70 %
Thread‑pool queue empty
Example numbers: safe QPS per node = 1800, replicas = 4.
Total capacity ≈ 1800 × 4 × 0.8 ≈ 5760 QPS0.8 is a redundancy/failure‑buffer factor.
8. Test‑Scenario Asset Library
baseline‑qps – establish performance baseline.
hotspot‑api – stress hotspot endpoints.
order‑link – core transaction path.
peak‑traffic – evaluate peak capacity.
soak‑test – long‑duration stability (1‑2 h).
9. Standard Operating Procedure (SOP)
1. Detect anomaly via monitoring
2. Propose bottleneck hypothesis
3. Design load‑test scenario
4. Execute load test with monitoring linkage
5. Perform attribution analysis
6. Implement optimization
7. Re‑run load test for verification
8. Update capacity model
9. Solidify alert thresholds10. Final Value Proposition
The real outcome of this engineering chain is not merely “how many QPS the system can achieve,” but rather “the safe, predictable, and scalable capacity the system can sustain under the current resource envelope.” This shift turns ad‑hoc load testing into a reusable, verifiable, and inheritable performance‑engineering capability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Ray's Galactic Tech
Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
