Operations 14 min read

How to Diagnose and Fix the 8 Most Common Production Issues

This article outlines practical troubleshooting steps for eight frequent production problems—including OOM, CPU spikes, interface timeouts, index failures, deadlocks, disk issues, MQ backlogs, and API errors—providing clear guidance and code snippets to help engineers quickly identify root causes and resolve them.

Su San Talks Tech
Su San Talks Tech
Su San Talks Tech
How to Diagnose and Fix the 8 Most Common Production Issues

1 OOM Issues

OOM problems in production are serious and can cause services to crash. Different types have different causes.

1.1 Java Heap OOM

Typical log entry: java.lang.OutOfMemoryError: Java heap space Add JVM options to dump heap on OOM:

-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=heapdump.hprof

Analyze the dump with MAT or VisualVM to locate the offending code.

1.2 Native Thread OOM

Log entry:

java.lang.OutOfMemoryError: unable to create native thread

Usually caused by too many threads or oversized thread stacks. Reduce thread count by using a thread pool.

1.3 StackOverflowError

Log entry: java.lang.StackOverflowError Often caused by deep or infinite recursion. Check recursive calls for correctness.

1.4 GC Overhead Limit Exceeded

Log entry:

java.lang.OutOfMemoryError: GC overhead limit exceeded

Occurs when too many objects are created during GC. Adjust GC strategy and survivor/new generation ratios.

1.5 Metaspace OOM

Log entry: java.lang.OutOfMemoryError: Metaspace After JDK 8, Metaspace replaces PermGen. It can overflow when many classes are loaded. Increase Metaspace size:

-XX:MetaspaceSize=10m -XX:MaxMetaspaceSize=10m

2 CPU 100% Issues

CPU saturation often results from long‑running tasks or inefficient code.

Common causes are illustrated below:

Use jstack or Alibaba's Arthas to inspect thread activity.

3 Interface Timeout Issues

Sudden API timeouts can stem from many factors. Common reasons are shown in the diagram:

4 Index Invalid Issues

Indexes may become ineffective, causing slow queries. Use EXPLAIN to view execution plans.

Typical causes are illustrated below:

5 Deadlock Issues

MySQL deadlocks occur when multiple transactions compete for the same resources.

Resource contention

Circular wait

Poor transaction design

Concurrent operation conflicts

Improper index usage

Mitigation steps:

Set appropriate transaction isolation levels.

Avoid large transactions.

Optimize SQL performance.

Configure lock wait timeouts.

Increase monitoring and analysis.

6 Disk Issues

Disk failures or insufficient space are common. Check usage with: df -Hl Monitor total size, used space, and available space. Clean /tmp and old logs to free space.

7 MQ Message Backlog

Backlogs happen when consumers process messages slower than producers generate them.

Possible causes:

Producers batch‑send messages.

Consumers suffer from slow business logic (e.g., MySQL index issues).

Solutions include increasing consumer thread pool size or scaling out consumer instances, and optimizing indexing and SQL.

8 API Error Codes

Various HTTP status codes indicate different problems:

8.1 401 Unauthorized

Occurs when authentication credentials are missing or invalid.

8.2 403 Forbidden

Authentication succeeded but the user lacks permission.

8.3 404 Not Found

Requested endpoint does not exist, often due to version changes or gateway misconfiguration.

8.4 405 Method Not Allowed

Wrong HTTP method used (e.g., GET instead of POST).

8.5 500 Internal Server Error

Server-side exception; check error logs and print request parameters for debugging.

8.6 502 Bad Gateway

Service unavailable, often due to restart or crash; restarting the service usually helps.

8.7 504 Gateway Timeout

Gateway timed out waiting for the upstream service; optimize the service code to reduce latency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

deadlockMQtroubleshootingAPICPUOOMTimeoutdisk
Su San Talks Tech
Written by

Su San Talks Tech

Su San, former staff at several leading tech companies, is a top creator on Juejin and a premium creator on CSDN, and runs the free coding practice site www.susan.net.cn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.