Backend Development 20 min read

Debugging Rare Core Dumps and Memory Leaks in High‑Concurrency Nginx with OpenSSL

The article describes a real‑world investigation of extremely rare core‑dump bugs and memory‑leak issues in a heavily modified Nginx+OpenSSL stack under high‑concurrency, detailing the debugging workflow, custom stress‑test tools, use of gdb, valgrind, AddressSanitizer, perf, flame graphs and performance‑tuning lessons.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Debugging Rare Core Dumps and Memory Leaks in High‑Concurrency Nginx with OpenSSL

Project Background

We performed deep modifications to the Nginx event framework and OpenSSL stack to improve HTTPS full‑handshake performance, which originally handled only ~400 qps per core for ECDHE_RSA.

Core Dump Debugging

Core dumps occurred with a probability of about one in a hundred million under >10k qps, often clustering at specific times. Traditional gdb and debug logs were ineffective because the asynchronous event model split logical request flows across multiple callbacks.

Defensive NULL‑pointer checks prevented crashes but masked the underlying issue, leading to repeated core dumps in different locations.

Reproducing the Bug

To accelerate debugging, a stable environment that could reliably trigger core dumps was needed. Observations suggested a correlation with weak network conditions during night‑time maintenance.

Constructing Weak Network Conditions

Instead of using tc directly, we decided to generate abnormal requests that simulate network instability, focusing on the TCP and SSL handshake phases.

WRK Stress‑Test Tool

We selected wrk -t500 -c2000 -d30s https://127.0.0.1:8443/index.html for its multi‑threaded, event‑driven architecture capable of generating millions of QPS.

Distributed Automated Test System

A controller machine orchestrates multiple client machines to achieve the required aggregate QPS, supporting configurable protocols, ports, URLs, SSL versions, and cipher suites.

Abnormal Request Construction

Randomly close TCP sockets with a 10% probability.

Randomly abort SSL handshakes at the ClientHello or ClientKeyExchange stages with a 10% probability.

Send HTTPS requests encrypted with an incorrect public key (10% probability) to force decryption failures.

Core Bug Fix Summary

With the reproducible test harness, core dumps were triggered within seconds, allowing rapid iteration of code changes, additional logging, and gdb analysis until the root cause—a misuse of a non‑reusable connection structure under extreme concurrency—was identified and fixed.

Memory Leak

High‑concurrency tests also revealed a memory leak of ~500 MiB per hour.

Valgrind Limitations

Valgrind provides comprehensive memory error detection but reduces performance by 10‑50×, making it unsuitable for reproducing leaks that only appear under heavy load.

AddressSanitizer Advantages

ASan offers fast detection with only ~2× slowdown. By recompiling Nginx with -fsanitize=address using clang, we isolated the leak related to OpenSSL error‑handling logic.

Performance Hotspot Analysis

After fixing crashes and leaks, we focused on profiling to locate remaining bottlenecks using tools such as perf, oprofile, gprof, and systemtap.

Flame Graph

Generating a flame graph with:

perf record -F 99 -p PID -g -- sleep 10
perf script | ./stackcollapse-perf.pl > out.perf-folded
./flamegraph.pl out.perf-folded > out.svg

revealed that rsaz_1024_mul_avx2 and rsaz_1024_sqr_avx2 consumed ~75% of samples, guiding further optimization.

Mindset

The three‑week debugging effort was stressful but valuable; it reinforced the importance of treating hard bugs as learning opportunities, leveraging off‑hours for fresh thinking, and openly discussing problems with teammates.

DebuggingPerformancehigh concurrencymemory-leakNginxValgrindCore Dumpasan
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.