Backend Development 16 min read

Why Our Nginx Data Gateway OOM’d: Tracing Memory Spikes & Core Dumps

An Nginx‑based data collection gateway began crashing with OOM kills, prompting a detailed investigation that uncovered memory spikes caused by aggressive batch processing, oversized protobuf payloads, and insufficient memory‑pool management, leading to a custom core‑dump solution and several mitigation strategies.

ITPUB

Nov 1, 2024

Why Our Nginx Data Gateway OOM’d: Tracing Memory Spikes & Core Dumps

Background

The data collection gateway, built on Nginx, serves mobile data ingestion for Ant Group. Since late last year it started experiencing occasional worker process crashes, though the master process would automatically restart the workers.

Initial Analysis

At first a memory leak was suspected, but overall memory usage appeared stable. Core‑dump files were not being generated, and the OOM‑killer uses SIGKILL, which prevents core dumps.

Signal      Standard   Action   Comment
──────────────────────────────────────────────────────
... 
SIGKILL     P1990       Term      Kill signal
...

Discovery of Memory Spike

Second‑level (per‑second) monitoring revealed that a worker’s memory jumped from a few hundred MB to over 10 GB within seconds, triggering the OOM‑killer.

Root Cause Investigation

External attacks and network spikes were ruled out. Attention turned to the data batch‑processing stage. The gateway creates a memory pool for each batch write; the pool size is at least the batch threshold plus the maximum per‑event size plus protocol metadata, roughly 3 MB per request.

Under normal traffic the memory usage stays around 30 % of the host’s RAM, but when many batch write requests are created faster than the pools are released, memory spikes dramatically.

Core‑Dump Workaround

To obtain diagnostic core files, a user‑space OOM‑killer was added. It monitors each worker’s memory and, once a configurable threshold is exceeded, sends SIGABRT to force a core dump. The auxiliary process limits itself to one trigger per restart to avoid accidental mass kills.

Reproducing the Issue

The core dump showed the stack in ngx_pcalloc during batch write, pointing to schema data handling. The gateway enforces three size limits:

Body size before decompression: 4 MB

Decompressed data limit: 32 MB

Per‑event size (dynamic, up to) 2 MB

Compression Bomb Hypothesis

Analysis of the core dump revealed a user ID repeated ~60 k times, suggesting a high compression ratio. A mock test generated 10 k protobuf events (≈34.7 MB raw) which compressed to 1.2 MB – a 3.5× reduction.

origin data bytes: 34697723
compressed data bytes: 1214252

Root Cause Confirmation

Sending a 32 MB raw request (within limits) reproduced the memory explosion. The batch threshold is 25.6 KB, so the request creates about 1 250 write requests. Each request allocates a ~3 MB pool, totaling ~3.75 GB. Because the gateway writes to three parallel channels (SLS and two downstream systems), the memory consumption exceeds 10 GB.

Mitigation

Several actions were taken:

Added per‑schema event count limits to prevent excessive batch sizes.

Adjusted the memory‑pool allocation to be dynamic rather than a fixed 3 MB.

Released raw protobuf data immediately after compression, keeping only the compressed payload in memory.

Explored implementing critical components in Rust for safer memory handling.

Conclusion

After half a year of hypothesis, testing, and code changes, the “time‑bomb” in the gateway was eliminated. The case highlights the importance of systematic debugging, collaborative discussion, and cautious assumptions when dealing with high‑throughput backend services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Batch Processing protobuf memory-leak OOM core dump backend debugging

Written by

ITPUB

Official ITPUB account sharing technical insights, community news, and exciting events.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.