Fundamentals 20 min read

Production Hit by Silent Data Corruption: JDK 25 G1GC Bug Explained

A rare silent data‑corruption bug in JDK 25’s G1GC caused Parquet and ORC files written by Spark and Flink to become unreadable, prompting a multi‑stage investigation that traced the issue to an optional evacuation flaw affecting JNI‑pinned objects, which was later back‑ported and fixed in the OpenJDK community.

dbaplus Community
dbaplus Community
dbaplus Community
Production Hit by Silent Data Corruption: JDK 25 G1GC Bug Explained

During a gray‑scale rollout of JDK 25 LTS on Ctrip’s large‑scale Spark and Flink clusters, engineers observed that Parquet and ORC files occasionally became unreadable despite successful writes and passing CRC checks. Downstream errors were Zstd decompression failures such as "Src size is incorrect", "Destination buffer is too small", and "Corrupted block detected".

Impact

The bug affects all released JDK 25 versions (25.0.0, 25.0.1, 25.0.2) when G1GC (the default GC) is enabled. Switching to -XX:+UseParallelGC or -XX:+UseZGC avoids the issue. A fix is expected in the 25.0.3 release (≈2026‑04‑21).

Root Cause

G1GC’s Optional Evacuation phase incorrectly moves objects that are pinned by JNI (e.g., via GetPrimitiveArrayCritical / ReleasePrimitiveArrayCritical). Native compression libraries such as zstd‑jni write directly to a Java array that has been pinned; G1GC may relocate the array, causing the native code to write to a stale address. This produces silent data corruption that only surfaces during later decompression.

Evidence

Affected components include zstd‑jni, the JDK’s built‑in java.util.zip library, and any native compression/encryption code that uses JNI critical sections.

Corrupted columns were identified with SELECT SUM(HASH(STRUCT(colX))), CRC checks passed, and only the corrupted columns triggered Zstd errors.

Reproduction Steps

Identify corrupted columns using SELECT SUM(HASH(STRUCT(colX))) on the problematic files.

Enable parquet.page.verify-checksum.enabled=true and run the Parquet CLI checksum verification to rule out storage‑medium issues.

Observe Zstd decompression errors in logs; reproduce with zstd -d and crc32 checks.

Investigation Path

Suspected factors: JDK 25 compact object headers, compression library compatibility, OS/kernel versions.

Version bisecting showed the bug appears in 25.0.0‑25.0.2 but not in 21, 23, 24, or early 25 builds.

Testing different GC algorithms revealed the issue only with G1GC; ParallelGC and ZGC behaved correctly.

Commit‑Level Analysis

Two OpenJDK commits were identified:

JDK‑8343782 ("G1: Use one G1CardSet instance for multiple old gen regions") introduced the bug; offending commit ID is 86cec4ea. URL: https://bugs.openjdk.org/browse/JDK-8343782

JDK‑8370807 ("Improve region attribute table method naming") fixed the bug by correcting handling of has_pinned_objects() during Optional Evacuations. This fix is back‑ported to JDK 25 (fix version 25, resolved in build b10) and present in JDK 26 (fix version 26, resolved in build b22). URL: https://bugs.openjdk.org/browse/JDK-8370807

G1GC Mechanics

G1GC divides the heap into equal‑sized regions and performs Young GC, Mixed GC, and Optional Evacuations (introduced in JEP 344). Optional Evacuations aim to reduce pause times by moving only a subset of old‑gen regions. The buggy commit caused G1GC to ignore the has_pinned_objects flag, leading to movement of JNI‑pinned objects.

Resolution and Backport

The OpenJDK community accepted a backport of JDK‑8377811 ("G1: Optional Evacuations may evacuate pinned objects") to JDK 25. The fix will be included in the 25.0.3 release, eliminating the silent corruption when G1GC is used. URL: https://bugs.openjdk.org/browse/JDK-8377811

Post‑mortem

In large YARN clusters, low executor reuse and ample memory hide the bug; in smaller clusters with high executor reuse, the bug surfaces more readily.

The bug’s impact extends beyond zstd‑jni to any JNI‑based compression (e.g., the JDK’s built‑in Zip/Deflate library).

Configuration Matrix (selected)

JDK 25 + -XX:+UseG1GC → Data corruption (YES ✗)

JDK 25 + -XX:+UseParallelGC → No corruption (YES ✓)

JDK 25 + -XX:+UseZGC → No corruption (YES ✓)

JDK 21 + -XX:+UseG1GC → No corruption (YES ✓)

JDK 25 (commit 86cec4ea) + G1GC → Corruption confirmed

JDK 25 (commit 006ed5c0) + G1GC → No corruption

Additional References

zstd‑jni issue report: https://github.com/luben/zstd-jni/issues/377

OpenJDK backport pull request: https://github.com/openjdk/jdk25u-dev/pull/272

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

FlinkG1GCSparkAI debuggingdata corruptionJDK25ZstdJNI
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.