Debugging Rare Core Dumps in High‑Concurrency Nginx: From GDB to ASan

This article details a real‑world investigation of extremely low‑probability core dumps and memory leaks in a heavily modified Nginx/OpenSSL stack, covering debugging strategies, custom traffic‑control testing, distributed load generation, use of valgrind and AddressSanitizer, performance profiling with perf, and the mindset needed to solve such high‑concurrency bugs.

dbaplus Community
dbaplus Community
dbaplus Community
Debugging Rare Core Dumps in High‑Concurrency Nginx: From GDB to ASan

Project Background

We performed deep modifications to the Nginx event framework and the OpenSSL protocol stack to boost HTTPS full‑handshake performance. Native Nginx computes RSA on the CPU, achieving only about 400 QPS per core; even with 24 cores the throughput cannot exceed 10k QPS.

Problems Encountered

Extremely low‑probability core dumps (≈1 in 10⁸) under high load.

Memory leak that appears only when QPS exceeds 10k.

Need to locate performance hotspots after the code changes.

Core Dump Debugging Approach

Initial attempts with gdb and debug logs proved ineffective because the crashes were caused by NULL‑pointer dereferences that are hard to trace in an asynchronous, multi‑process event model.

Adding defensive NULL checks prevented the crash at the original location but merely moved it elsewhere, confirming that the root cause lay deeper in the code.

Improved Logging Strategies

Enable debug logs only for specific client IPs.

Add custom high‑level logs (EMERG) at critical code paths.

Run Nginx with a single worker and limited connections, sampling logs by connection ID.

Reproducing the Bug

Construct a stable environment that reliably triggers core dumps by injecting network instability and abnormal requests.

Traffic Control

Use Linux tc to emulate weak network conditions, or the Facebook‑derived apc tool for more complex scenarios.

High‑Concurrency Load Generator

Deploy wrk, a multithreaded, event‑driven HTTP benchmark capable of generating millions of QPS.

Example command:

wrk -t500 -c2000 -d30s https://127.0.0.1:8443/index.html

Distributed Test System

Control multiple client machines from a central node to achieve >30k QPS, varying protocols, ports, and cipher suites.

Concurrent start/stop of clients.

Support HTTP/HTTPS and reverse‑proxy testing.

Configurable test duration, URL, and SSL parameters.

Abnormal Request Scenarios

Randomly close TCP sockets during the connect syscall (10% probability).

Abort SSL handshake at client‑hello or client‑key‑exchange stages (10% probability each).

Send HTTPS requests encrypted with an incorrect public key (10% probability).

Core Bug Fix Summary

With the reproducible environment, repeated code changes, GDB sessions, and enriched logs finally revealed that a missing non‑reusable flag caused connection structures to be recycled and set to NULL, leading to crashes in different locations.

Memory‑Leak Investigation

The usual first choice is valgrind , but its 10‑50× slowdown makes it unsuitable for high‑load tests.

Switch to AddressSanitizer (ASan) , which only halves performance. Compile Nginx with:

--with-cc="clang" \
--with-cc-opt="-g -fPIC -fsanitize=address -fno-omit-frame-pointer"

ASan quickly identified the leak related to OpenSSL error handling.

Performance Hotspot Analysis

Use Linux profiling tools ( perf, oprofile, gprof, systemtap) to locate bottlenecks. Example perf workflow:

perf record -F 99 -p PID -g -- sleep 10
perf script | ./stackcollapse-perf.pl > out.perf-folded
/flamegraph.pl out.perf-folded > out.svg

Flame graphs revealed that rsaz_1024_mul_avx2 and rsaz_1024_sqr_avx2 consumed ~75% of samples, guiding further optimization.

After applying an asynchronous proxy computation solution, RSA‑related hotspots disappeared.

Mindset

Debugging rare, high‑concurrency bugs is a valuable learning opportunity; treat each crash as a chance to deepen tool knowledge and improve code quality.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

debuggingPerformancehigh concurrencyNGINXvalgrindcore dumpAddressSanitizer
dbaplus Community
Written by

dbaplus Community

Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.