Why Our Nginx Data Gateway OOM’d: Tracing Memory Spikes & Core Dumps
An Nginx‑based data collection gateway began crashing with OOM kills, prompting a detailed investigation that uncovered memory spikes caused by aggressive batch processing, oversized protobuf payloads, and insufficient memory‑pool management, leading to a custom core‑dump solution and several mitigation strategies.
Background
The data collection gateway, built on Nginx, serves mobile data ingestion for Ant Group. Since late last year it started experiencing occasional worker process crashes, though the master process would automatically restart the workers.
Initial Analysis
At first a memory leak was suspected, but overall memory usage appeared stable. Core‑dump files were not being generated, and the OOM‑killer uses SIGKILL, which prevents core dumps.
Signal Standard Action Comment
──────────────────────────────────────────────────────
...
SIGKILL P1990 Term Kill signal
...Discovery of Memory Spike
Second‑level (per‑second) monitoring revealed that a worker’s memory jumped from a few hundred MB to over 10 GB within seconds, triggering the OOM‑killer.
Root Cause Investigation
External attacks and network spikes were ruled out. Attention turned to the data batch‑processing stage. The gateway creates a memory pool for each batch write; the pool size is at least the batch threshold plus the maximum per‑event size plus protocol metadata, roughly 3 MB per request.
Under normal traffic the memory usage stays around 30 % of the host’s RAM, but when many batch write requests are created faster than the pools are released, memory spikes dramatically.
Core‑Dump Workaround
To obtain diagnostic core files, a user‑space OOM‑killer was added. It monitors each worker’s memory and, once a configurable threshold is exceeded, sends SIGABRT to force a core dump. The auxiliary process limits itself to one trigger per restart to avoid accidental mass kills.
Reproducing the Issue
The core dump showed the stack in ngx_pcalloc during batch write, pointing to schema data handling. The gateway enforces three size limits:
Body size before decompression: 4 MB
Decompressed data limit: 32 MB
Per‑event size (dynamic, up to) 2 MB
Compression Bomb Hypothesis
Analysis of the core dump revealed a user ID repeated ~60 k times, suggesting a high compression ratio. A mock test generated 10 k protobuf events (≈34.7 MB raw) which compressed to 1.2 MB – a 3.5× reduction.
origin data bytes: 34697723
compressed data bytes: 1214252Root Cause Confirmation
Sending a 32 MB raw request (within limits) reproduced the memory explosion. The batch threshold is 25.6 KB, so the request creates about 1 250 write requests. Each request allocates a ~3 MB pool, totaling ~3.75 GB. Because the gateway writes to three parallel channels (SLS and two downstream systems), the memory consumption exceeds 10 GB.
Mitigation
Several actions were taken:
Added per‑schema event count limits to prevent excessive batch sizes.
Adjusted the memory‑pool allocation to be dynamic rather than a fixed 3 MB.
Released raw protobuf data immediately after compression, keeping only the compressed payload in memory.
Explored implementing critical components in Rust for safer memory handling.
Conclusion
After half a year of hypothesis, testing, and code changes, the “time‑bomb” in the gateway was eliminated. The case highlights the importance of systematic debugging, collaborative discussion, and cautious assumptions when dealing with high‑throughput backend services.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
