How to Debug Rare Core Dumps and Memory Leaks in High‑Concurrency Nginx/OpenSSL Deployments
This article walks through a real‑world investigation of intermittent core dumps and memory leaks in a heavily modified Nginx + OpenSSL stack under extreme QPS, detailing debugging strategies, custom load‑testing tools, and performance‑profiling techniques that helped pinpoint and fix the root causes.
Project Background
We deeply refactored Nginx's event framework and the OpenSSL protocol stack to boost HTTPS full‑handshake performance. The original Nginx used the CPU for RSA calculations, limiting ECDHE_RSA throughput to about 400 QPS per core; even with 24 cores the total stayed under 10 k QPS.
After the refactor, performance rose severalfold, but stress testing at >10 k QPS exposed three critical issues:
Extremely rare (≈1 in 10⁸) core dumps occurring at different code locations.
Memory leaks that appear only under high concurrency.
Difficulty locating performance hotspots for further optimization.
All problems manifested only when the load exceeded tens of thousands of QPS.
Core Dump Debugging Approach
Initial attempts with gdb and debug logs proved ineffective because the crashes were sporadic and the stack traces were incomplete in an asynchronous, multi‑process architecture.
Key observations:
All core dumps were caused by NULL pointer dereferences, but the pointers should never be NULL in the original Nginx code.
Repeatedly adding NULL checks prevented crashes at a specific site but caused new crashes elsewhere, indicating a deeper lifecycle issue.
The asynchronous event model splits a logical request into many independent callbacks, making it hard to trace which event set a pointer to NULL.
Example: a client GET request may trigger read events A→B, then later A→C; if the crash occurs in C, the earlier B call is lost from the stack.
Because enabling DEBUG logging flooded the disk and killed performance, we experimented with selective logging:
Enable DEBUG only for specific client IPs.
Use high‑severity (EMERG) logs on critical paths.
Run a single Nginx worker and sample connections by ID.
These methods improved visibility but still did not isolate the root cause.
Reproducing the Bug Offline
To accelerate debugging, we built a stable environment that reliably triggers core dumps:
Constructed a high‑concurrency pressure‑test system capable of generating tens of thousands of QPS.
Designed abnormal request patterns, especially during the TLS handshake, such as randomly closing sockets or sending malformed client‑hello messages.
We also considered using tc (traffic control) to simulate flaky networks, but ultimately focused on crafting malformed TLS traffic directly.
Tools for High‑Concurrency Testing
WRK – an open‑source HTTP load generator built on multi‑threaded, asynchronous event loops (uses Redis' ae and Nginx's parsing code). It can push several million requests per second on a single machine.
Typical command used:
wrk -t500 -c2000 -d30s https://127.0.0.1:8443/index.htmlBecause a single client cannot generate the required load, we created a distributed test harness that coordinates many machines to drive HTTPS traffic simultaneously, supporting configurable ports, URLs, cipher suites, and protocol versions.
Constructing Abnormal TLS Requests
Three fault injection scenarios were implemented:
Randomly close the TCP socket during the initial connect() (10% probability).
During the TLS handshake, either close the connection after sending ClientHello or after ClientKeyExchange (each with 10% probability).
Encrypt a request with an incorrect public key, forcing Nginx decryption to fail.
These injections reliably produced core dumps within seconds.
Memory‑Leak Diagnosis
After fixing the core‑dump issue, a severe memory leak surfaced (≈500 MiB per hour under load). We evaluated two major analysis tools:
Valgrind
Pros: No recompilation needed, detects uninitialized memory, out‑of‑bounds accesses, etc.
Cons: Drastically slows execution (10‑50×). In our case, a 20 k QPS service dropped to ~400 QPS, making the leak invisible during realistic stress.
AddressSanitizer (ASan)
Pros: Integrated into Clang/GCC, incurs only ~2× slowdown, works well with high‑load tests.
We rebuilt Nginx with:
--with-cc="clang" \
--with-cc-opt="-g -fPIC -fsanitize=address -fno-omit-frame-pointer"ASan quickly exposed the leak, which stemmed from improper handling of non‑reusable connections during massive asynchronous proxy calculations.
Performance Hotspot Analysis
With the core‑dump and leak resolved, we turned to profiling the optimized Nginx:
Identify pre‑optimization bottlenecks.
Verify that no new hotspots were introduced.
We selected several Linux profiling tools:
perf – comprehensive kernel‑level profiler (recommended).
oprofile – older, less convenient.
gprof – requires recompilation, limited to user‑space.
systemtap – powerful dynamic tracing, steeper learning curve.
Using perf record -F 99 -p $PID -g -- sleep 10 followed by perf script | ./stackcollapse-perf.pl > out.perf-folded and ./flamegraph.pl out.perf-folded > out.svg, we generated flame graphs that highlighted functions such as rsaz_1024_mul_avx2 and rsaz_1024_sqr_avx2 consuming ~75% of samples.
These insights guided us to offload RSA calculations to an asynchronous proxy, eliminating the hotspot.
Mindset and Takeaways
The three‑week debugging marathon taught several lessons:
Treat hard‑to‑reproduce bugs as valuable learning opportunities.
Persistently instrument code and logs, even if it temporarily hurts performance.
Collaborate openly; many breakthroughs came from team discussions.
Understanding tools like tc, wrk, perf, Valgrind, and ASan equips engineers to tackle rare, high‑impact failures in production systems.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
ITPUB
Official ITPUB account sharing technical insights, community news, and exciting events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
