Fundamentals 12 min read

Unraveling Java Agent Crashes: JVM, glibc, and Thread‑Local Pitfalls

Starting from mysterious Java Agent errors in Alibaba Cloud environments, this article traces the failure through the JVM’s Instrumentation.appendToSystemClassLoaderSearch call, examines glibc’s stat and iconv conversions, reveals thread‑local storage issues, and presents concrete fixes using pthread TLS and proxy wrappers.

Alibaba Cloud Native
Alibaba Cloud Native
Alibaba Cloud Native
Unraveling Java Agent Crashes: JVM, glibc, and Thread‑Local Pitfalls

Background

Alibaba Cloud provides multiple Java Agents for users. When several agents are used together, the total agent loading time increases, leading to higher memory and resource consumption. The one-java-agent project was created to coordinate agents and enable more efficient bytecode injection.

In the premain phase, each agent is loaded in parallel threads, reducing the overall startup complexity from O(n) to O(1).

Problem

During validation of a new agent version, the premain phase of one-java-agent started to throw errors such as:

2022-06-15 06:22:47 [oneagent plugin arms-agent start] ERROR c.a.o.plugin.PluginManagerImpl -start plugin error, name: arms-agent</code><code>com.alibaba.oneagent.plugin.PluginException: start error, agent jar::/home/admin/.opt/ArmsAgent/plugins/ArmsAgent/arms-bootstrap-1.7.0-SNAPSHOT.jar</code><code>Caused by: java.lang.InternalError: null</code><code>    at sun.instrument.InstrumentationImpl.appendToClassLoaderSearch0(Native Method)</code><code>    at sun.instrument.InstrumentationImpl.appendToSystemClassLoaderSearch(InstrumentationImpl.java:200)

The same pattern appeared for other plugins (e.g., ahas-java-agent), indicating a failure in Instrumentation.appendToSystemClassLoaderSearch.

Initial Investigation

Adding logs at the JNI entry point of appendToClassLoaderSearch showed that the log entry was missing when the error occurred, suggesting the native code never reached the logging point.

Further debugging on a Dragonwell‑8 container revealed two key observations:

When printf was used, the output appeared only after adding fflush(stdout), confirming that buffered output needed explicit flushing.

The create_class_path_zip_entry function returned NULL, leading to a failed stat call.

Root Cause Analysis

The stat call reported “No such file or directory” even though the JAR path existed. Investigation showed that the path string sometimes became empty or corrupted (e.g., ending with .jarSHOT.jar) because the native function convertUft8ToPlatformString wrote overlapping memory.

Further digging into convertUft8ToPlatformString revealed that it ultimately uses iconv to convert UTF‑8 to the platform charset. In the container, the environment variable LANG was unset, causing the JVM to fall back to ANSI_X3.4-1968 and invoke iconv.

Documentation confirms that an iconv_t descriptor opened by iconv_open is not thread‑safe . The JVM kept a global iconv_t, so concurrent calls from multiple agent‑loading threads caused race conditions, corrupting the path string and ultimately triggering the InternalError.

Fixes Implemented

Java‑side Fix

Adding a simple synchronized block around the instrumentation call eliminated the race, but it required extensive code changes because the Instrumentation object is scattered throughout the codebase.

Instead, a proxy wrapper ( InstrumentationWrapper) was introduced to centralise the lock, reducing the modification surface.

// Pseudo‑code for the wrapper
public class InstrumentationWrapper {
    private static final Object lock = new Object();
    private final Instrumentation delegate;
    public void appendToSystemClassLoaderSearch(JarFile jar) {
        synchronized (lock) {
            delegate.appendToSystemClassLoaderSearch(jar);
        }
    }
}

JVM‑side Fix

Because Java‑side locking is costly and the root problem lies in the native implementation, the JVM code was patched to use pthread thread‑local storage for the iconv_t descriptor.

The solution uses the following pthread APIs: pthread_key_create – creates a thread‑local key. pthread_setspecific – stores the iconv_t for the current thread. pthread_getspecific – retrieves it when needed. pthread_once – ensures the key is created only once.

The destructor passed to pthread_key_create closes the icon_t when the thread exits, preventing leaks.

The final native code (simplified) looks like:

static pthread_key_t iconv_key;
static pthread_once_t once = PTHREAD_ONCE_INIT;
static void make_key() { pthread_key_create(&iconv_key, (void(*)(void*))iconv_close); }

static iconv_t get_iconv() {
    pthread_once(&once, make_key);
    iconv_t *p = pthread_getspecific(iconv_key);
    if (!p) {
        p = malloc(sizeof(iconv_t));
        *p = iconv_open("UTF-8", "ANSI_X3.4-1968");
        pthread_setspecific(iconv_key, p);
    }
    return *p;
}

After rebuilding the JDK with this patch, rebuilding the Docker image, and redeploying the pods, the original crash disappeared.

Conclusion

The investigation progressed from Java‑level logs to JNI, glibc, and pthread internals, uncovering three main pitfalls:

Buffered printf output requires fflush in container logs.

Missing or incorrect LANG environment variables can change character‑set handling. iconv is not thread‑safe, and a global iconv_t leads to race conditions.

By applying a Java‑side proxy wrapper and a native pthread‑TLS fix, the one‑java‑agent project now loads agents reliably in cloud‑native containers.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JVMpthreadJava AgentglibcIConvThread Local
Alibaba Cloud Native
Written by

Alibaba Cloud Native

We publish cloud-native tech news, curate in-depth content, host regular events and live streams, and share Alibaba product and user case studies. Join us to explore and share the cloud-native insights you need.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.