Big Data 21 min read

Why Upgrading to JDK 25 Broke Spark & Flink Data – Inside the G1GC Bug and Its Fix

During a gray‑release of JDK 25 on Ctrip's massive Spark and Flink clusters, silent data corruption appeared in Parquet and ORC files, traced to a G1GC Optional Evacuation bug that moved JNI‑pinned objects, a root cause later back‑ported and fixed in JDK 25.0.3.

Ctrip Technology
Ctrip Technology
Ctrip Technology
Why Upgrading to JDK 25 Broke Spark & Flink Data – Inside the G1GC Bug and Its Fix

Background

Ctrip’s production environment runs large‑scale Spark and Flink clusters on JDK 21. To leverage JDK 25 LTS features such as Compact Object Headers (JEP 519) for lower memory usage and better GC performance, a migration plan was launched, adapting multiple engines to JDK 25 and beginning a gray‑rollout.

Impact and Why the Bug Is Dangerous

The bug affects all released JDK 25 versions (25.0.0, 25.0.1, 25.0.2) when G1GC is enabled (the default GC). It silently corrupts data during write‑time, with no exceptions, and the corruption only surfaces when downstream jobs attempt to read the files. Switching to ParallelGC or ZGC avoids the issue, and a fix is expected in the upcoming 25.0.3 release.

The root cause is that G1GC’s Optional Evacuation phase incorrectly moves objects that are pinned by JNI (via GetPrimitiveArrayCritical / ReleasePrimitiveArrayCritical), leading to memory‑address mismatches for native compression libraries such as zstd‑jni and the built‑in java.util.zip implementation.

Symptom: Decompression Errors

Jobs complete without error, but downstream reads of specific columns trigger Zstd‑related exceptions, for example:

Caused by: com.github.luben.zstd.ZstdException: Src size is incorrect
    at com.github.luben.zstd.ZstdDecompressCtx.decompressByteArray(ZstdDecompressCtx.java:205)
    ...
Caused by: java.io.IOException: Decompression error: Destination buffer is too small
    at com.github.luben.zstd.ZstdInputStreamNoFinalizer.readInternal(ZstdInputStreamNoFinalizer.java:171)
    ...

File‑Level Analysis

Using Parquet’s page checksum ( parquet.page.write-checksum.enabled=true) and verification ( parquet.page.verify-checksum.enabled=true) showed no storage‑level corruption, confirming the issue originates in the JVM.

Local inspection of the corrupted Zstd streams with the zstd -d tool reported “Data corruption detected”, while CRC32 checks matched the page header, indicating the corruption occurs after compression but before write‑out.

Suspected Causes

JDK 25 Compact Object Header changes affecting memory layout.

Incompatible native compression libraries (zstd‑jni, ORC, Parquet) with JDK 25.

Operating‑system or kernel incompatibilities.

Other unknown interactions.

Reproduction Attempts

Selective Spark jobs that exhibited intermittent failures were reproduced on both physical clusters and Docker‑based clusters. The bug appeared consistently on Docker clusters with differing OS/kernel versions, but not on larger YARN clusters where executors were less reused.

Code‑Level Experiments

Various attempts were made, such as enabling Zstd checksum in ORC ( zstdCompressCtx.setCheckSum(true)) and implementing a compress‑then‑decompress validation loop, but none isolated the root cause.

JDK‑Level Experiments

Disabling Compact Object Headers did not resolve the issue.

All JDK 25 releases (25.0.0‑25.0.2) reproduced the bug; JDK 21‑24 did not.

Switching GC to ParallelGC or ZGC avoided corruption, pointing to G1GC as the culprit.

A custom GitHub Actions workflow was built to compile JDK 25 from specific commits, allowing binary‑level bisect.

Root‑Cause Identification

Bisecting identified commit 86cec4ea (JDK‑8343782) as introducing the defect: “G1: Use one G1CardSet instance for multiple old gen regions”. This change caused G1GC to move regions containing JNI‑pinned objects during Optional Evacuation.

The chain of events:

Native code (zstd‑jni) obtains a direct pointer to a Java byte array via GetPrimitiveArrayCritical, marking the region as has_pinned_objects.

During G1GC Optional Evacuation, the region is mistakenly moved because the pinned flag is ignored.

The native compressor continues writing to the original address, corrupting the byte array silently.

Why the Fix Works

Commit JDK‑8370807 (“Improve region attribute table method naming”) corrected the logic that checks has_pinned_objects() during Optional Evacuation, ensuring pinned regions are respected and not moved. This fix was back‑ported to JDK 25 (JDK‑8377811) and slated for 25.0.3.

Post‑mortem

The bug’s impact extends beyond zstd‑jni to any JNI code that pins arrays, including the built‑in Zip/Deflate library. It manifests as silent data loss, making it especially hazardous for large‑scale data pipelines.

AI‑Assisted Debugging

Multiple AI tools were employed throughout the investigation:

Zstd log analysis: AI extracted key patterns from thousands of log lines, narrowing the focus to compression failures.

Binary data analysis: AI parsed Zstd frames and generated a Python recovery script.

JDK build automation: GitHub Copilot and GitHub Agents generated a workflow to compile JDK variants with specific commit IDs.

Commit search: AI‑powered code search helped locate the offending G1GC changes quickly.

The combined AI assistance accelerated the root‑cause identification and verification process.

Conclusion

The investigation revealed a subtle G1GC optimization bug that broke JNI‑pinned memory handling, leading to silent corruption of compressed data in Spark and Flink workloads. The issue was isolated to JDK 25’s default G1GC, fixed in back‑ported patches, and will be resolved in the upcoming 25.0.3 release. Users can mitigate the problem today by switching to ParallelGC or ZGC, or by upgrading to a JDK version where the fix is present.

Flinkg1gcJDKSparkData Corruption
Ctrip Technology
Written by

Ctrip Technology

Official Ctrip Technology account, sharing and discussing growth.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.