Mastering Arthas: Fast, Non‑Intrusive Debugging of Java Production Issues
This guide demonstrates how Arthas enables rapid, non‑intrusive diagnosis of Java production problems by addressing common online challenges such as slow interfaces, thread blocking, memory leaks, hot‑fixes, and data inconsistencies, offering concrete commands, code examples, and best‑practice tips for reliable operations.
Preface
I have been woken up at 3 am by alarms and experienced the pain of locating production problems.
90% of online issues stem from the "three unknowns": not knowing which part is slow, who is blocked, and why it fails.
This article shows how to use Arthas to quickly locate online problems.
1. Why Conventional Tools Fail Online
Three Special Characteristics of Online Environments
Traditional Tool Limitations
Log loss : critical parameters not printed, cannot reproduce later.
Monitoring lag : 1‑minute granularity misses instantaneous anomalies.
JProfiler paralysis : cannot open when CPU spikes.
Arthas Advantages
# 1 second to attach to production environment
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar
# Auto‑detect Java process2. Five Typical Problem‑Location Scenarios
Scenario 1: Slow Interface
Symptom : Order query 99% of requests return in 200 ms, 1% spike to 5 s.
Traditional approach :
// Blindly add log
log.info("Query start: {}", System.currentTimeMillis()); // pollutes log and inefficientArthas precise hit :
# 1. Trace internal call path
trace com.example.OrderService getOrderById '#cost>1000' -n 5Flame graph output:
Root cause : occasional TCP timeout in risk‑control service.
Solution :
# Adjust connection timeout
risk:
client:
connection-timeout: 500
read-timeout: 1000Scenario 2: Thread Blocking Mystery
Symptom : Payment callback interface hangs at midnight.
Traditional approach :
jstack > thread.log # but blocking already endedArthas breakthrough :
# 1. View thread state distribution
thread -b # shows blocked threads
# 2. Monitor lock contention
watch java.util.concurrent.locks.ReentrantLock getQueueLengthDiagnostic report:
Root cause : Logback synchronous logging blocks business threads.
Solution :
<appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
<queueSize>1024</queueSize>
<appender-ref ref="FILE"/>
</appender>Scenario 3: Memory Leak Capture
Symptom : Container restarts daily.
Traditional approach :
jmap -histo:live pid # triggers Full GC and destroys snapshotArthas magic :
# 1. Monitor heap objects
dashboard -i 5000 # refresh every 5 s
# 2. Trace object creation path
vmtool --action getInstances --className LoginDTO --limit 10Abnormal finding: [LoginDTO] instances: 245,680 (growth 0.5%/min) Root cause : ThreadLocal not cleared.
public class UserHolder {
private static ThreadLocal<LoginDTO> cache = new ThreadLocal<>();
public static void set(LoginDTO dto) { cache.set(dto); }
}Solution :
try {
// business code
} finally {
UserHolder.remove(); // force cleanup
}Scenario 4: Hot‑Fix Code Saves Crash
Symptom : New pagination query causes OOM; rollback takes an hour.
Traditional approach :
Approval process
Merge code
Compile & package
Redeploy → heavy business loss
Arthas rescue :
# 1. Decompile problematic method
jad com.example.UserService listUsers
# 2. Edit local file
vi UserService.java # fix memory leak
# 3. Hot‑update class
redefine -c 327a3b4 /tmp/UserService.classHot‑update principle diagram:
Scenario 5: Data Inconsistency Mystery
Symptom : Order status shows paid but database not updated.
Arthas investigation :
# 1. Monitor method params/return
watch com.service.OrderService updateStatus "{params,returnObj}" -x 3
# 2. Observe call chain
stack com.service.OrderService updateStatusCaptured abnormal call chain:
updateStatus(OrderStatus.PAID) // correct call
|- Thread1: payment callback
updateStatus(OrderStatus.CREATED) // abnormal call
|- Thread2: order query compensation taskRoot cause : Compensation task incorrectly overwrites status.
Solution :
// Add state‑machine validation
if (currentStatus != CREATED) {
throw new IllegalStateException("State rollback prohibited");
}3. Arthas Underlying Principles
Why Non‑Intrusive Diagnosis Works?
Arthas uses the Attach mechanism to inject an agent via VirtualMachine.attach, weaves bytecode with ASM, and isolates classes with a custom ClassLoader to avoid polluting business code.
Diagnostic Command Execution Flow
4. Advanced Arthas Combination Skills
Performance Analysis Golden Combo
# 1. Macro overview
Dashboard -i 5000
# 2. Locate CPU hotspots
profiler start
profiler stop --format html
# 3. Trace slow methods
trace *StringUtils substring '#cost>100'Complex Problem‑Solving Framework
5. Pitfall Avoidance Guide
Three Mandatory Rules
Minimize principle : avoid monitoring everything; target specific packages, e.g., watch com.example.service.* *.
Safety first : never run high‑risk commands in production; use reset * and stop to clean up.
Resource control : limit memory usage with options save-result false and options batch-size 50.
True masters don’t just solve problems; they make problems disappear.
When you wield Arthas like a surgical knife, every online crisis becomes a stage to showcase deep technical expertise.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
macrozheng
Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
