Operations 10 min read

Mastering Arthas: Fast, Non‑Intrusive Debugging of Java Production Issues

This guide demonstrates how Arthas enables rapid, non‑intrusive diagnosis of Java production problems by addressing common online challenges such as slow interfaces, thread blocking, memory leaks, hot‑fixes, and data inconsistencies, offering concrete commands, code examples, and best‑practice tips for reliable operations.

macrozheng
macrozheng
macrozheng
Mastering Arthas: Fast, Non‑Intrusive Debugging of Java Production Issues

Preface

I have been woken up at 3 am by alarms and experienced the pain of locating production problems.

90% of online issues stem from the "three unknowns": not knowing which part is slow, who is blocked, and why it fails.

This article shows how to use Arthas to quickly locate online problems.

1. Why Conventional Tools Fail Online

Three Special Characteristics of Online Environments

Traditional Tool Limitations

Log loss : critical parameters not printed, cannot reproduce later.

Monitoring lag : 1‑minute granularity misses instantaneous anomalies.

JProfiler paralysis : cannot open when CPU spikes.

Arthas Advantages

# 1 second to attach to production environment
curl -O https://arthas.aliyun.com/arthas-boot.jar
java -jar arthas-boot.jar
# Auto‑detect Java process

2. Five Typical Problem‑Location Scenarios

Scenario 1: Slow Interface

Symptom : Order query 99% of requests return in 200 ms, 1% spike to 5 s.

Traditional approach :

// Blindly add log
log.info("Query start: {}", System.currentTimeMillis()); // pollutes log and inefficient

Arthas precise hit :

# 1. Trace internal call path
trace com.example.OrderService getOrderById '#cost>1000' -n 5

Flame graph output:

Root cause : occasional TCP timeout in risk‑control service.

Solution :

# Adjust connection timeout
risk:
  client:
    connection-timeout: 500
    read-timeout: 1000

Scenario 2: Thread Blocking Mystery

Symptom : Payment callback interface hangs at midnight.

Traditional approach :

jstack > thread.log # but blocking already ended

Arthas breakthrough :

# 1. View thread state distribution
thread -b # shows blocked threads
# 2. Monitor lock contention
watch java.util.concurrent.locks.ReentrantLock getQueueLength

Diagnostic report:

Root cause : Logback synchronous logging blocks business threads.

Solution :

<appender name="ASYNC" class="ch.qos.logback.classic.AsyncAppender">
  <queueSize>1024</queueSize>
  <appender-ref ref="FILE"/>
</appender>

Scenario 3: Memory Leak Capture

Symptom : Container restarts daily.

Traditional approach :

jmap -histo:live pid # triggers Full GC and destroys snapshot

Arthas magic :

# 1. Monitor heap objects
dashboard -i 5000 # refresh every 5 s
# 2. Trace object creation path
vmtool --action getInstances --className LoginDTO --limit 10

Abnormal finding: [LoginDTO] instances: 245,680 (growth 0.5%/min) Root cause : ThreadLocal not cleared.

public class UserHolder {
  private static ThreadLocal<LoginDTO> cache = new ThreadLocal<>();
  public static void set(LoginDTO dto) { cache.set(dto); }
}

Solution :

try {
  // business code
} finally {
  UserHolder.remove(); // force cleanup
}

Scenario 4: Hot‑Fix Code Saves Crash

Symptom : New pagination query causes OOM; rollback takes an hour.

Traditional approach :

Approval process

Merge code

Compile & package

Redeploy → heavy business loss

Arthas rescue :

# 1. Decompile problematic method
jad com.example.UserService listUsers
# 2. Edit local file
vi UserService.java # fix memory leak
# 3. Hot‑update class
redefine -c 327a3b4 /tmp/UserService.class

Hot‑update principle diagram:

Scenario 5: Data Inconsistency Mystery

Symptom : Order status shows paid but database not updated.

Arthas investigation :

# 1. Monitor method params/return
watch com.service.OrderService updateStatus "{params,returnObj}" -x 3
# 2. Observe call chain
stack com.service.OrderService updateStatus

Captured abnormal call chain:

updateStatus(OrderStatus.PAID) // correct call
 |- Thread1: payment callback
updateStatus(OrderStatus.CREATED) // abnormal call
 |- Thread2: order query compensation task

Root cause : Compensation task incorrectly overwrites status.

Solution :

// Add state‑machine validation
if (currentStatus != CREATED) {
  throw new IllegalStateException("State rollback prohibited");
}

3. Arthas Underlying Principles

Why Non‑Intrusive Diagnosis Works?

Arthas uses the Attach mechanism to inject an agent via VirtualMachine.attach, weaves bytecode with ASM, and isolates classes with a custom ClassLoader to avoid polluting business code.

Diagnostic Command Execution Flow

4. Advanced Arthas Combination Skills

Performance Analysis Golden Combo

# 1. Macro overview
Dashboard -i 5000
# 2. Locate CPU hotspots
profiler start
profiler stop --format html
# 3. Trace slow methods
trace *StringUtils substring '#cost>100'

Complex Problem‑Solving Framework

5. Pitfall Avoidance Guide

Three Mandatory Rules

Minimize principle : avoid monitoring everything; target specific packages, e.g., watch com.example.service.* *.

Safety first : never run high‑risk commands in production; use reset * and stop to clean up.

Resource control : limit memory usage with options save-result false and options batch-size 50.

True masters don’t just solve problems; they make problems disappear.

When you wield Arthas like a surgical knife, every online crisis becomes a stage to showcase deep technical expertise.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaperformanceArthas
macrozheng
Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.