Advanced Paimon Production Issues: 10 Rare Compaction‑Related Problems and Fixes
This article enumerates ten uncommon, compaction‑related problems encountered in large‑scale Paimon deployments, explains their root causes—such as RPC timeouts, snapshot expiration, file corruption, and write conflicts—and provides concrete configuration tweaks and operational steps to resolve each issue.
Paimon has become an indispensable component of lake‑warehouse architectures and serves as the vector‑storage foundation in the AI era. The following list documents about ten rarely seen, deep‑usage problems—mostly tied to compaction—and offers root‑cause analysis and remediation steps.
1. Ask timed out Exception
Occurs in Dedicated Compaction tasks when the OperatorCoordinator of the Paimon compact source fails to deliver split allocation events to a subtask within 180 seconds (Akka Ask Timeout), causing event loss and failover. The issue surfaces under massive real‑time writes that generate many small files, overloading TaskManager nodes or causing network jitter.
Diagnosis: frequent GC, overly high parallelism of the compact source. Mitigation: increase Akka timeout settings.
2. Expired Snapshot
java.lang.RuntimeException: Cannot find snapshot for scan.snapshot-id = 12345, it may have been expired by another compaction jobRoot cause: a read task still consumes an old snapshot while a compact/expire job has already cleaned it.
Solution: enlarge snapshot.time-retained (e.g., to 24h) so that snapshots needed by readers are not removed prematurely.
3. ORC/Parquet File Corruption
java.io.IOException: Error reading ORC file: hdfs://xxx/bucket-0/data-xxx.orc
Caused by: java.lang.ArrayIndexOutOfBoundsExceptionRoot cause: TaskManager killed or OOM during write leads to incomplete file writes; or HDFS block replica loss.
Solution: the paimon-compact job skips corrupted files; manually run ALTER TABLE ... RESET PARTITION and reprocess historical data.
4. Concurrent Write Conflict on Bucket
java.lang.RuntimeException: Multiple concurrent write to the same bucket detected!
bucket = 5, partition = [2026-05-20]Root cause: multiple Flink jobs (e.g., main job + back‑fill job) write to the same bucket of a Paimon table.
Solution: ensure a single writer per partition + bucket; run back‑fill in an isolated partition or use overwrite mode.
5. Manifest Explosion (OOM)
java.lang.OutOfMemoryError: Java heap space
at org.apache.paimon.manifest.ManifestFile.read(...)Root cause: excessive small files and partition explosion cause manifest metadata to grow to gigabyte scale.
Solution: enable manifest.merge-min-count, increase compaction frequency to reduce small files, and clean up old partitions.
6. Compaction Never Finishes
WARN CompactManager - Compaction task for bucket 3 has been running for 1800s, exceeding compaction.max-timeoutRoot cause: a single bucket holds tens of gigabytes, making one compaction round read/write massive data; or a sort‑compaction triggers full re‑ordering.
Solution: increase the number of buckets to disperse data, reduce the compaction range via full-compaction.delta-commits, and consider the lookup engine instead of deduplicate.
7. Missing Changelog Files
java.io.FileNotFoundException: changelog-bucket-0-xxx.orc (No such file or directory)
Caused by: org.apache.paimon.fs.FileNotFoundExceptionRoot cause: downstream CDC consumes changelog files that have been removed by snapshot expiration.
Solution: increase changelog.time-retained; if downstream latency is high, set changelog-producer = full-compaction to reduce file count.
8. Schema Evolution Mismatch
java.lang.IllegalStateException: Column 'new_col' does not exist in table schema.
Current schema id = 5, required schema id = 8Root cause: after a Flink job restart, the reader obtains a newer schema version while older data files lack the new column.
Solution: ensure schema.evolution.enabled=true and avoid pinning an old schema‑id on reader restart.
9. Orphan Files Accumulation
WARN StorageCleanup - Found 128456 orphan files under /table/bucket-0/ that are not referenced by any snapshotRoot cause: writer failures leave uncommitted temporary files (e.g., .staging-xxx) that are not cleaned.
Solution: periodically run
CALL sys.remove_orphan_files(table => 'db.table', older_than => '2026-01-01').
10. Commit Retry Exhausted
org.apache.paimon.operation.FileStoreCommit$CommitConflictException: Commit conflict detected after 10 retries, abortingRoot cause: high‑concurrency writes combined with frequent compactions cause optimistic‑lock conflicts beyond the retry limit.
Solution: increase commit.retries (default 10 → 50), lower compaction submission frequency, and separate write and compaction tasks.
11. Lookup Join Timeout / State Backend Pressure
java.util.concurrent.TimeoutException: Lookup join on paimon table timed out after 60s.
RocksDB compaction backlog: 128 pending filesRoot cause: Paimon lookup join uses RocksDB local cache; when data volume grows, RocksDB compaction cannot keep up.
Solution: increase lookup.cache-max-memory-size, enable lookup.cache-bloom-filter.enabled = true, and reduce the number of buckets per lookup table to keep each shard smaller.
All listed issues, aside from obvious configuration mistakes, stem from large‑scale, frequent writes. Proper tuning of time‑retention settings, compaction parameters, and resource allocation can mitigate them.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Big Data Technology & Architecture
Wang Zhiwu, a big data expert, dedicated to sharing big data technology.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
