Backend Development 15 min read

Why Did Our Nginx‑Based Data Gateway Suddenly Consume 10 GB RAM and Crash?

A detailed post‑mortem explains how a special request caused an Nginx‑based data‑collection gateway to spike memory usage, trigger OOM‑killer termination, and crash, and walks through the debugging steps, core‑dump analysis, root‑cause discovery, and the eventual fix.

dbaplus Community

Nov 20, 2024

Why Did Our Nginx‑Based Data Gateway Suddenly Consume 10 GB RAM and Crash?

Problem Background

The "埋点网关" is an Nginx‑based gateway used for mobile data collection. Since late last year it occasionally crashed, but the master process would restart workers and the client would retry, so the issue seemed harmless at first.

Initial Analysis

Because the service is written in C, crashes were expected. Monitoring showed memory usage stayed around 40%, so a leak was ruled out. However, core‑dump files were not being generated, suggesting the process was killed by the OOM‑killer, which sends SIGKILL and prevents core dumps.

Signal      Standard   Action   Comment
───────────────────────────────────────
SIGIOT         -        Core    IOT trap. A synonym for SIGABRT
SIGKILL      P1990      Term    Kill signal
SIGLOST        -        Term    File lock lost (unused)
...

Core‑Dump Investigation

Since SIGKILL prevented core dumps, a lightweight user‑space OOM‑killer was written: it polls worker memory usage each second and sends SIGABRT when a threshold is exceeded, forcing a core dump. After deployment, core files were obtained (initially truncated, later fixed by lowering the threshold to 4 GB).

Reasonable Guess and Monitoring

Further monitoring with second‑level metrics revealed a sharp spike in memory usage on the crashing machine: a worker’s memory jumped from a few hundred MB to over 10 GB within seconds, causing the kernel to kill it.

The suspicion shifted to the data‑batching stage, where memory pools are allocated per write request. Each pool is at least 3 MB to accommodate the largest possible event, and many write requests can be created quickly.

Root Cause

Two factors combined:

Single requests could carry massive amounts of data; because many fields were repeated, compression ratios exceeded 3.5×, allowing >4 MB compressed bodies to bypass size limits.

During batching, the gateway creates a write request for every ~25.6 KB of data. A 32 MB payload therefore spawns about 1,250 write requests, each allocating ~3 MB, totaling ~3.75 GB. The gateway also forwards the same data to three downstream channels, multiplying the memory to >11 GB.

This explains the sudden >10 GB memory consumption and OOM‑killer termination.

Solution

Implemented limits on the number of schema events per request, added alerts for excessive schema uploads, and refined memory‑pool handling:

Release raw (pre‑compression) data immediately after the HTTP request is sent, keeping only the compressed payload.

Allocate memory pools dynamically based on actual batch size instead of a fixed 3 MB.

Consider rewriting critical components in a memory‑safe language such as Rust.

Takeaways

Even well‑tested services can hide “time‑bomb” bugs that surface only under specific data patterns. Systematic hypothesis testing, core‑dump analysis, and fine‑grained monitoring are essential for diagnosing such issues. Limiting request sizes, improving memory‑pool strategies, and early data release can prevent similar OOM incidents.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

debugging NGINX core dump memory OOM data batching

Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.