Debugging Rare Core Dumps in High‑Concurrency Nginx: From GDB to ASan
This article details a real‑world investigation of extremely low‑probability core dumps and memory leaks in a heavily modified Nginx/OpenSSL stack, covering debugging strategies, custom traffic‑control testing, distributed load generation, use of valgrind and AddressSanitizer, performance profiling with perf, and the mindset needed to solve such high‑concurrency bugs.
Project Background
We performed deep modifications to the Nginx event framework and the OpenSSL protocol stack to boost HTTPS full‑handshake performance. Native Nginx computes RSA on the CPU, achieving only about 400 QPS per core; even with 24 cores the throughput cannot exceed 10k QPS.
Problems Encountered
Extremely low‑probability core dumps (≈1 in 10⁸) under high load.
Memory leak that appears only when QPS exceeds 10k.
Need to locate performance hotspots after the code changes.
Core Dump Debugging Approach
Initial attempts with gdb and debug logs proved ineffective because the crashes were caused by NULL‑pointer dereferences that are hard to trace in an asynchronous, multi‑process event model.
Adding defensive NULL checks prevented the crash at the original location but merely moved it elsewhere, confirming that the root cause lay deeper in the code.
Improved Logging Strategies
Enable debug logs only for specific client IPs.
Add custom high‑level logs (EMERG) at critical code paths.
Run Nginx with a single worker and limited connections, sampling logs by connection ID.
Reproducing the Bug
Construct a stable environment that reliably triggers core dumps by injecting network instability and abnormal requests.
Traffic Control
Use Linux tc to emulate weak network conditions, or the Facebook‑derived apc tool for more complex scenarios.
High‑Concurrency Load Generator
Deploy wrk, a multithreaded, event‑driven HTTP benchmark capable of generating millions of QPS.
Example command:
wrk -t500 -c2000 -d30s https://127.0.0.1:8443/index.htmlDistributed Test System
Control multiple client machines from a central node to achieve >30k QPS, varying protocols, ports, and cipher suites.
Concurrent start/stop of clients.
Support HTTP/HTTPS and reverse‑proxy testing.
Configurable test duration, URL, and SSL parameters.
Abnormal Request Scenarios
Randomly close TCP sockets during the connect syscall (10% probability).
Abort SSL handshake at client‑hello or client‑key‑exchange stages (10% probability each).
Send HTTPS requests encrypted with an incorrect public key (10% probability).
Core Bug Fix Summary
With the reproducible environment, repeated code changes, GDB sessions, and enriched logs finally revealed that a missing non‑reusable flag caused connection structures to be recycled and set to NULL, leading to crashes in different locations.
Memory‑Leak Investigation
The usual first choice is valgrind , but its 10‑50× slowdown makes it unsuitable for high‑load tests.
Switch to AddressSanitizer (ASan) , which only halves performance. Compile Nginx with:
--with-cc="clang" \
--with-cc-opt="-g -fPIC -fsanitize=address -fno-omit-frame-pointer"ASan quickly identified the leak related to OpenSSL error handling.
Performance Hotspot Analysis
Use Linux profiling tools ( perf, oprofile, gprof, systemtap) to locate bottlenecks. Example perf workflow:
perf record -F 99 -p PID -g -- sleep 10 perf script | ./stackcollapse-perf.pl > out.perf-folded /flamegraph.pl out.perf-folded > out.svgFlame graphs revealed that rsaz_1024_mul_avx2 and rsaz_1024_sqr_avx2 consumed ~75% of samples, guiding further optimization.
After applying an asynchronous proxy computation solution, RSA‑related hotspots disappeared.
Mindset
Debugging rare, high‑concurrency bugs is a valuable learning opportunity; treat each crash as a chance to deepen tool knowledge and improve code quality.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
dbaplus Community
Enterprise-level professional community for Database, BigData, and AIOps. Daily original articles, weekly online tech talks, monthly offline salons, and quarterly XCOPS&DAMS conferences—delivered by industry experts.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
