How to Generate Lua‑Level Flame Graphs for OpenResty Using SystemTap and eBPF
This article explains how to produce Lua‑level flame graphs for OpenResty by leveraging SystemTap’s lj‑lua‑stacks tool, demonstrates the underlying data structures and call‑stack extraction, and explores a possible eBPF‑based rewrite for safer, kernel‑level tracing.
Introduction
This note describes how to generate Lua‑level flame graphs for OpenResty‑based Kubernetes gateways (e.g., ALB). Lua‑level flame graphs resolve Lua function names and source lines, allowing developers to pinpoint performance hot spots inside LuaJIT code.
Why Lua‑Level Flame Graphs
Standard Nginx flame graphs only show C symbols, which hides bottlenecks introduced by Lua code (e.g., slow Prometheus metric handling). By walking the LuaJIT call stack and emitting file:line entries, the flame graph reveals the exact Lua functions that dominate CPU time.
Toolchain
The community tool lj‑lua‑stacks (shipped with stapxx ) uses SystemTap to inject probes into a running Nginx worker, reads LuaJIT internal structures, extracts function names, and feeds the data to flamegraph.pl.
SystemTap Overview
SystemTap compiles a script into a kernel module that can read user‑space memory. It provides syntactic sugar such as @var("module", "binary") to obtain the address of a global variable directly from a running process.
Running lj‑lua‑stacks with stapxx
sudo stap \
-k \
-x $NG_PID \
-d $target/nginx/sbin/nginx \
-d $target/luajit/lib/libluajit-5.1.so.2.1.0 \
-d /usr/lib/ld-linux-x86-64.so.2 \
./all-in-one.stapOnly the Nginx PID is required; stapxx resolves symbol paths and merges the necessary object files.
Lua‑Level Data Structures
Process Memory Layout
Each Nginx worker starts from the ngx_cycle structure, which holds pointers to all modules, including ngx_http_lua_module. The Lua VM state is represented by lua_State (per‑coroutine) and global_State (shared across coroutines). For a running Lua function, the address range between base and the current stack pointer defines the call‑stack region. For JIT‑compiled code the range is from jit_base to the current stack top.
Call‑Stack Traversal
Each frame points to a garbage‑collected object ( gcobj). If the object is a Lua function, its source file name and line number are stored in the associated GCproto. If it is a C function, the usysname field contains the C symbol.
Example Stack‑Dump Function
function luajit_debug_dumpstack(L, T, depth, base, simple)
bot = $*L->stack->ptr64 + @sizeof_TValue //@LJ_FR2
for (nextframe = frame = base - @sizeof_TValue; frame > bot; ) {
if (@frame_gc(frame) == L) { tmp_level++ }
if (tmp_level-- == 0) {
size = (nextframe - frame) / @sizeof_TValue
found_frame = 1
break
}
nextframe = frame
if (@frame_islua(frame)) {
frame = @frame_prevl(frame)
} else {
if (@frame_isvarg(frame)) { tmp_level++; }
frame = @frame_prevd(frame);
}
}
if (!found_frame) { frame = 0; size = tmp_level }
if (frame) {
nextframe = size ? frame + size * @sizeof_TValue : 0
fn = luajit_frame_func(frame)
if (@isluafunc(fn)) {
pt = @funcproto(fn)
line = luajit_debug_frameline(L, T, fn, pt, nextframe)
name = luajit_proto_chunkname(pt) /* GCstr *name */
path = luajit_unbox_gcstr(name)
bt .= sprintf("%s:%d
", path, line)
}
} else if (dir == 1) { break } else { level -= size }
endProblem Diagnosis Example
Applying the generated flame graph to an ALB instance revealed that a large fraction of CPU time was spent in the Prometheus client library handling metrics. The library was not optimized for the multi‑threaded environment of Nginx workers. Upgrading to a newer, thread‑aware version reduced the metrics‑related latency to an acceptable level.
Limitations of SystemTap
SystemTap runs as a kernel module; a buggy script can crash the kernel. Although the compiler performs strict checks, the risk of system instability remains.
Proposed eBPF Rewrite
Design Goals
Trace pointer fields in global_State and lua_State to walk the Lua call stack.
Capture the function name (or C symbol) for each frame.
Extracting Offsets with pahole
pahole --compile -C GCobj,GG_State,lua_State,global_State /path/to/libluajit-5.1.so.2.1.0 > offsets.h
sed -i '/.*typedef.*__uint64_t.*/d' offsets.h
sed -i '/.*typedef.*__int64_t.*/d' offsets.h
sed -i 's/Node/LJNode/g' offsets.hSample excerpt for struct global_State:
struct global_State {
lua_Alloc allocf; /* 0 8 */
void * allocd; /* 8 8 */
GCState gc; /* 16 104 */
GCstr strempty; /* 120 24 */
uint8_t stremptyz; /* 144 1 */
// ...
}Demo eBPF Program
#include <nginx.h>
#define READ_STRUCT(ret, ret_t, p, type, access) \
do { \
type val; \
bpf_probe_read_user(&val, sizeof(type), p); \
ret = (ret_t)((val)access); \
} while (0)
void *GLP = (void *)0x7cc2e558c380; // placeholder address of global_State
void *luajit_G(void) {
void *ret;
READ_STRUCT(ret, void *, GLP, lua_State, .glref.ptr64);
return ret;
}
void *luajit_cur_thread(void *g) {
void *gco;
size_t offset = offsetof(struct global_State, cur_L);
READ_STRUCT(gco, void *, g + offset, struct GCRef, .gcptr64);
return gco; // points to the current lua_State
}References
ALB open‑source gateway repository: https://github.com/alauda/alb.git
stapxx project (lj‑lua‑stacks): https://github.com/Kong/stapxx
stap++ source code: https://github.com/Kong/stapxx/blob/kong/stap%2B%2B
pahole manual: https://linux.die.net/man/1/pahole
Cloud Native Technology Community
The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
