How a Hidden Compression Bomb Triggered OOM Crashes in an Nginx Data Gateway
A special request caused memory usage to spike dramatically, leading to an OOM‑killer termination of an Nginx‑based data‑collection gateway, and the investigation uncovered a compression‑bomb style payload and coarse memory‑pool allocation as the root causes.
Problem Background
The data‑collection gateway, built on Nginx, started experiencing occasional worker process crashes. Although the master process would restart the workers, the crashes were traced to out‑of‑memory (OOM) conditions.
Initial Analysis
Memory usage on the host appeared stable (around 40% of RAM), but minute‑level metrics missed short spikes. Core‑dump files were not being generated because the OOM‑killer sends SIGKILL, which prevents core dumps.
When you hit a dead end in a maze, you must reconsider the previous steps.
Discovering the OOM Trigger
Second‑level monitoring revealed that, at the crash moment, a worker’s memory jumped from a few hundred MB to over 10 GB within seconds, causing the kernel to kill the process.
To obtain a core‑dump, a user‑space helper was added that monitors worker memory and, when a threshold is exceeded, sends SIGABORT to force a dump.
Memory‑Pool Investigation
The gateway processes data in stages: request reception, processing, batching, and sending. Each batch creates a memory pool (≈3 MB) that is released only after the HTTP request is fully sent.
Under normal load, the pool is quickly freed, keeping memory usage low. However, if many batch‑write requests are created faster than they can be sent, memory pools accumulate.
Signal Standard Action Comment
───────────────────────────────────────
SIGIOT - Core IOT trap (synonym for SIGABRT)
SIGKILL P1990 Term Kill signal
SIGLOST - Term File lock lost (unused)
...Root Cause: Compression Bomb
Testing with mock data showed that a payload of 10 000 identical events (≈34.7 MB uncompressed) compressed to only 1.2 MB, a 3.5× compression ratio. The gateway’s 4 MB body limit applied to the compressed payload, allowing a massive number of events to pass.
origin data bytes: 34697723
compressed data bytes: 1214252Each batch of ~25.6 KB triggered a write request; a 32 MB payload therefore generated about 1 250 write requests. With three parallel output channels, the memory demand reached >10 GB.
Solution
Added limits on the number of schema events per request.
Released raw data memory immediately after compression, keeping only the compressed payload.
Made memory‑pool allocation dynamic instead of a fixed 3 MB per request.
Considered rewriting critical components in a memory‑safe language such as Rust.
Conclusion
After months of hypothesis, testing, and verification, the “time‑bomb” was eliminated. The case highlights the importance of fine‑grained memory management, proper payload size checks, and thorough monitoring in high‑throughput backend services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Alibaba Cloud Developer
Alibaba's official tech channel, featuring all of its technology innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
