Big Data 9 min read

Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes

This article enumerates ten uncommon, compaction‑related problems encountered in large‑scale Paimon deployments, explains their root causes—such as RPC timeouts, snapshot expiration, file corruption, and write conflicts—and provides concrete configuration tweaks and operational steps to resolve each issue.

Big Data Technology & Architecture

May 26, 2026

Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes

Paimon has become an indispensable component of lake‑warehouse architectures and serves as the vector‑storage foundation in the AI era. The following list documents about ten rarely seen, deep‑usage problems—mostly tied to compaction—and offers root‑cause analysis and remediation steps.

1. Ask timed out Exception

Occurs in Dedicated Compaction tasks when the OperatorCoordinator of the Paimon compact source fails to deliver split allocation events to a subtask within 180 seconds (Akka Ask Timeout), causing event loss and failover. The issue surfaces under massive real‑time writes that generate many small files, overloading TaskManager nodes or causing network jitter.

Diagnosis: frequent GC, overly high parallelism of the compact source. Mitigation: increase Akka timeout settings.

2. Expired Snapshot

java.lang.RuntimeException: Cannot find snapshot for scan.snapshot-id = 12345, it may have been expired by another compaction job

Root cause: a read task still consumes an old snapshot while a compact/expire job has already cleaned it.

Solution: enlarge snapshot.time-retained (e.g., to 24h) so that snapshots needed by readers are not removed prematurely.

3. ORC/Parquet File Corruption

java.io.IOException: Error reading ORC file: hdfs://xxx/bucket-0/data-xxx.orc
Caused by: java.lang.ArrayIndexOutOfBoundsException

Root cause: TaskManager killed or OOM during write leads to incomplete file writes; or HDFS block replica loss.

Solution: the paimon-compact job skips corrupted files; manually run ALTER TABLE ... RESET PARTITION and reprocess historical data.

4. Concurrent Write Conflict on Bucket

java.lang.RuntimeException: Multiple concurrent write to the same bucket detected!
bucket = 5, partition = [2026-05-20]

Root cause: multiple Flink jobs (e.g., main job + back‑fill job) write to the same bucket of a Paimon table.

Solution: ensure a single writer per partition + bucket; run back‑fill in an isolated partition or use overwrite mode.

5. Manifest Explosion (OOM)

java.lang.OutOfMemoryError: Java heap space
 at org.apache.paimon.manifest.ManifestFile.read(...)

Root cause: excessive small files and partition explosion cause manifest metadata to grow to gigabyte scale.

Solution: enable manifest.merge-min-count, increase compaction frequency to reduce small files, and clean up old partitions.

6. Compaction Never Finishes

WARN  CompactManager - Compaction task for bucket 3 has been running for 1800s, exceeding compaction.max-timeout

Root cause: a single bucket holds tens of gigabytes, making one compaction round read/write massive data; or a sort‑compaction triggers full re‑ordering.

Solution: increase the number of buckets to disperse data, reduce the compaction range via full-compaction.delta-commits, and consider the lookup engine instead of deduplicate.

7. Missing Changelog Files

java.io.FileNotFoundException: changelog-bucket-0-xxx.orc (No such file or directory)
Caused by: org.apache.paimon.fs.FileNotFoundException

Root cause: downstream CDC consumes changelog files that have been removed by snapshot expiration.

Solution: increase changelog.time-retained; if downstream latency is high, set changelog-producer = full-compaction to reduce file count.

8. Schema Evolution Mismatch

java.lang.IllegalStateException: Column 'new_col' does not exist in table schema.
Current schema id = 5, required schema id = 8

Root cause: after a Flink job restart, the reader obtains a newer schema version while older data files lack the new column.

Solution: ensure schema.evolution.enabled=true and avoid pinning an old schema‑id on reader restart.

9. Orphan Files Accumulation

WARN StorageCleanup - Found 128456 orphan files under /table/bucket-0/ that are not referenced by any snapshot

Root cause: writer failures leave uncommitted temporary files (e.g., .staging-xxx) that are not cleaned.

Solution: periodically run

CALL sys.remove_orphan_files(table => 'db.table', older_than => '2026-01-01')

10. Commit Retry Exhausted

org.apache.paimon.operation.FileStoreCommit$CommitConflictException: Commit conflict detected after 10 retries, aborting

Root cause: high‑concurrency writes combined with frequent compactions cause optimistic‑lock conflicts beyond the retry limit.

Solution: increase commit.retries (default 10 → 50), lower compaction submission frequency, and separate write and compaction tasks.

11. Lookup Join Timeout / State Backend Pressure

java.util.concurrent.TimeoutException: Lookup join on paimon table timed out after 60s.
RocksDB compaction backlog: 128 pending files

Root cause: Paimon lookup join uses RocksDB local cache; when data volume grows, RocksDB compaction cannot keep up.

Solution: increase lookup.cache-max-memory-size, enable lookup.cache-bloom-filter.enabled = true, and reduce the number of buckets per lookup table to keep each shard smaller.

All listed issues, aside from obvious configuration mistakes, stem from large‑scale, frequent writes. Proper tuning of time‑retention settings, compaction parameters, and resource allocation can mitigate them.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

big data Flink Compaction troubleshooting Paimon Snapshot Lakehouse

Written by

Big Data Technology & Architecture

Wang Zhiwu, a big data expert, dedicated to sharing big data technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.