Operations 8 min read

Building a Full Performance Engineering Loop with Spring Boot, SkyWalking, and Prometheus

This guide walks through constructing a sustainable performance‑engineering pipeline—from monitoring and metrics collection with SkyWalking, Prometheus, and Grafana, through targeted load testing and bottleneck analysis, to capacity modeling and alert solidification—for Spring Boot services.

Ray's Galactic Tech
Ray's Galactic Tech
Ray's Galactic Tech
Building a Full Performance Engineering Loop with Spring Boot, SkyWalking, and Prometheus

Performance‑Engineering Loop Overview

True performance engineering goes beyond a single load‑test run; it establishes a continuous loop that collects monitoring data, discovers bottlenecks, runs targeted stress tests, validates optimizations, evaluates capacity, solidifies alerts, and repeats.

Monitoring → Bottleneck Insight → Targeted Load Test → Optimization Verification → Capacity Evaluation → Alert Solidification → Continuous Regression

Overall Architecture

The loop combines four core tools:

SkyWalking – provides distributed tracing (“view the call chain”).

Prometheus – gathers resource and business metrics (“view resources & business indicators”).

Grafana – visualizes all facts in a unified dashboard.

Load‑testing tool – creates the problems to be investigated.

The diagram below illustrates the complete system.

1. Monitoring Stack Deployment

Deploy Prometheus, Grafana, SkyWalking OAP, SkyWalking UI, and Elasticsearch with Docker Compose:

version: '3.8'
services:
  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

  skywalking-oap:
    image: apache/skywalking-oap-server
    environment:
      - SW_STORAGE=elasticsearch
    ports:
      - "11800:11800"
      - "12800:12800"

  skywalking-ui:
    image: apache/skywalking-ui
    environment:
      - SW_OAP_ADDRESS=http://skywalking-oap:12800
    ports:
      - "8080:8080"

  elasticsearch:
    image: elasticsearch:7.10.0
    environment:
      - discovery.type=single-node

2. Spring Boot Integration

SkyWalking Agent

java -javaagent:/opt/skywalking/agent/skywalking-agent.jar \
  -Dskywalking.agent.service_name=order-service \
  -jar app.jar

Expose Prometheus Metrics via Micrometer

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
management:
  endpoints:
    web:
      exposure:
        include: prometheus,health,info
  metrics:
    export:
      prometheus:
        enabled: true

Metrics are available at http://localhost:8080/actuator/prometheus.

3. Prometheus Scrape Configuration

global:
  scrape_interval: 5s

scrape_configs:
  - job_name: 'springboot'
    metrics_path: '/actuator/prometheus'
    static_configs:
      - targets: ['host.docker.internal:8080']

  - job_name: 'skywalking'
    static_configs:
      - targets: ['skywalking-oap:1234']

4. Grafana Core Metrics

Key business and system indicators:

QPS : rate(http_server_requests_seconds_count[1m]) P99 latency :

histogram_quantile(0.99, sum(rate(http_server_requests_seconds_bucket[5m])) by (le))

Error rate :

sum(rate(http_server_requests_seconds_count{status=~"5.."}[1m])) / sum(rate(http_server_requests_seconds_count[1m]))

CPU usage : process_cpu_usage JVM heap : jvm_memory_used_bytes{area="heap"} GC pauses :

rate(jvm_gc_pause_seconds_count[1m])

5. Monitoring‑Driven Load Testing

The goal of a load test is to verify a bottleneck hypothesis rather than simply “run as many requests as possible.” Example hypotheses derived from observed metrics:

P99 is high → suspect code hotspot.

CPU low but QPS stalls → thread‑pool saturation.

DB connection pool full → database bottleneck.

GC spikes → memory pressure.

k6 Load‑Test Script Example

import http from 'k6/http';
import { sleep } from 'k6';

export let options = {
  stages: [
    { duration: '1m', target: 50 },
    { duration: '2m', target: 200 },
    { duration: '2m', target: 500 },
  ],
};

export default function () {
  http.get('http://localhost:8080/api/order/create');
  sleep(1);
}

6. Standardized Bottleneck Attribution Matrix

Typical observations, SkyWalking clues, Prometheus signals, and conclusions:

QPS not increasing – Trace normal, thread‑pool active = full → concurrency model bottleneck.

Latency spikes – Span node high, CPU low → code hotspot.

Error rate rise – Downstream call failures, DB connection pool full → database bottleneck.

Latency jitter – Trace dispersion, GC count high → JVM memory issue.

7. Capacity Evaluation Model

Define “single‑node safe capacity” as the maximum QPS that simultaneously satisfies:

P99 < SLA (e.g., 500 ms)

Error rate < 0.1 %

CPU < 70 %

DB connection pool < 70 %

Thread‑pool queue empty

Example numbers: safe QPS per node = 1800, replicas = 4.

Total capacity ≈ 1800 × 4 × 0.8 ≈ 5760 QPS

0.8 is a redundancy/failure‑buffer factor.

8. Test‑Scenario Asset Library

baseline‑qps – establish performance baseline.

hotspot‑api – stress hotspot endpoints.

order‑link – core transaction path.

peak‑traffic – evaluate peak capacity.

soak‑test – long‑duration stability (1‑2 h).

9. Standard Operating Procedure (SOP)

1. Detect anomaly via monitoring
2. Propose bottleneck hypothesis
3. Design load‑test scenario
4. Execute load test with monitoring linkage
5. Perform attribution analysis
6. Implement optimization
7. Re‑run load test for verification
8. Update capacity model
9. Solidify alert thresholds

10. Final Value Proposition

The real outcome of this engineering chain is not merely “how many QPS the system can achieve,” but rather “the safe, predictable, and scalable capacity the system can sustain under the current resource envelope.” This shift turns ad‑hoc load testing into a reusable, verifiable, and inheritable performance‑engineering capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

PrometheusSpring BootLoad Testingperformance engineeringGrafanaskywalking
Ray's Galactic Tech
Written by

Ray's Galactic Tech

Practice together, never alone. We cover programming languages, development tools, learning methods, and pitfall notes. We simplify complex topics, guiding you from beginner to advanced. Weekly practical content—let's grow together!

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.