Operations 4 min read

How to Diagnose Sudden Service Latency Spikes: A Step‑by‑Step Guide

This guide explains how to identify and resolve sudden latency increases in running services by collecting key metrics, focusing on GC behavior, thread and lock analysis, CPU and system resources, and using diagnostic tools such as jstat, jstack, and distributed tracing.

Architect Chen
Architect Chen
Architect Chen
How to Diagnose Sudden Service Latency Spikes: A Step‑by‑Step Guide

Initial Confirmation and Data Collection

Monitor the following indicators: response time (P99/P95), throughput, concurrency, error rate, CPU usage, memory usage, and GC metrics (GC count and pause time). Align the time series of four critical curves—interface RT, GC pause time, CPU utilization, and active thread count—and compare trends before and after the anomaly.

Key Investigation Directions

Garbage Collection (GC) : Examine GC logs and monitoring data. Frequent Full GC or long GC pauses directly cause latency spikes. Pay attention to Old generation occupancy, promotion failures, memory fragmentation, and whether the GC strategy (CMS, G1, etc.) is appropriate. If GC is the root cause, adjust heap size, GC algorithm, Metaspace/direct memory settings, or optimize object allocation and caching.

Threads and Locks : Capture thread stacks with jstack to detect blocking, deadlocks, long‑waiting locks, or synchronous methods. Check thread‑pool queues; saturation or backlog can cause request queuing or timeouts, requiring pool size adjustments or business‑logic optimizations.

CPU and System Resources : Verify CPU usage and load using tools like top or iostat. High CPU may indicate hotspot code or excessive GC. Also examine I/O saturation, context‑switch rates, file handles, socket counts, disk latency, and network packet loss.

Application Layer : Review recent code releases, dependency upgrades, or configuration changes. Identify slow queries, hot data, cache misses, or cache‑penetration issues that could suddenly increase backend pressure.

Diagnostic Tools and Methods

Use Java diagnostic utilities such as jstat, jmap, jstack, and jcmd to inspect heap usage, trigger heap dumps, capture thread snapshots, and retrieve GC information.

GC log analysis tools (e.g., GCViewer) and profilers (YourKit, VisualVM, async‑profiler) help locate hot methods and allocation hotspots. Distributed tracing systems like Zipkin or Jaeger, together with APM solutions such as SkyWalking or Prometheus + Grafana, are useful for pinpointing inter‑service latency sources. Online snapshots and sampling can be performed without impacting production.

PerformanceLatencyTroubleshootinggc
Architect Chen
Written by

Architect Chen

Sharing over a decade of architecture experience from Baidu, Alibaba, and Tencent.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.