Operations 13 min read

How to Generate Lua‑Level Flame Graphs for OpenResty Using SystemTap and eBPF

This article explains how to produce Lua‑level flame graphs for OpenResty by leveraging SystemTap’s lj‑lua‑stacks tool, demonstrates the underlying data structures and call‑stack extraction, and explores a possible eBPF‑based rewrite for safer, kernel‑level tracing.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
How to Generate Lua‑Level Flame Graphs for OpenResty Using SystemTap and eBPF

Introduction

This note describes how to generate Lua‑level flame graphs for OpenResty‑based Kubernetes gateways (e.g., ALB). Lua‑level flame graphs resolve Lua function names and source lines, allowing developers to pinpoint performance hot spots inside LuaJIT code.

Why Lua‑Level Flame Graphs

Standard Nginx flame graphs only show C symbols, which hides bottlenecks introduced by Lua code (e.g., slow Prometheus metric handling). By walking the LuaJIT call stack and emitting file:line entries, the flame graph reveals the exact Lua functions that dominate CPU time.

Toolchain

The community tool lj‑lua‑stacks (shipped with stapxx ) uses SystemTap to inject probes into a running Nginx worker, reads LuaJIT internal structures, extracts function names, and feeds the data to flamegraph.pl.

SystemTap Overview

SystemTap compiles a script into a kernel module that can read user‑space memory. It provides syntactic sugar such as @var("module", "binary") to obtain the address of a global variable directly from a running process.

Running lj‑lua‑stacks with stapxx

sudo stap \
  -k \
  -x $NG_PID \
  -d $target/nginx/sbin/nginx \
  -d $target/luajit/lib/libluajit-5.1.so.2.1.0 \
  -d /usr/lib/ld-linux-x86-64.so.2 \
  ./all-in-one.stap

Only the Nginx PID is required; stapxx resolves symbol paths and merges the necessary object files.

Lua‑Level Data Structures

Process Memory Layout

Each Nginx worker starts from the ngx_cycle structure, which holds pointers to all modules, including ngx_http_lua_module. The Lua VM state is represented by lua_State (per‑coroutine) and global_State (shared across coroutines). For a running Lua function, the address range between base and the current stack pointer defines the call‑stack region. For JIT‑compiled code the range is from jit_base to the current stack top.

Call‑Stack Traversal

Each frame points to a garbage‑collected object ( gcobj). If the object is a Lua function, its source file name and line number are stored in the associated GCproto. If it is a C function, the usysname field contains the C symbol.

Example Stack‑Dump Function

function luajit_debug_dumpstack(L, T, depth, base, simple)
    bot = $*L->stack->ptr64 + @sizeof_TValue //@LJ_FR2
    for (nextframe = frame = base - @sizeof_TValue; frame > bot; ) {
        if (@frame_gc(frame) == L) { tmp_level++ }
        if (tmp_level-- == 0) {
            size = (nextframe - frame) / @sizeof_TValue
            found_frame = 1
            break
        }
        nextframe = frame
        if (@frame_islua(frame)) {
            frame = @frame_prevl(frame)
        } else {
            if (@frame_isvarg(frame)) { tmp_level++; }
            frame = @frame_prevd(frame);
        }
    }
    if (!found_frame) { frame = 0; size = tmp_level }
    if (frame) {
        nextframe = size ? frame + size * @sizeof_TValue : 0
        fn = luajit_frame_func(frame)
        if (@isluafunc(fn)) {
            pt = @funcproto(fn)
            line = luajit_debug_frameline(L, T, fn, pt, nextframe)
            name = luajit_proto_chunkname(pt)  /* GCstr *name */
            path = luajit_unbox_gcstr(name)
            bt .= sprintf("%s:%d
", path, line)
        }
    } else if (dir == 1) { break } else { level -= size }
end

Problem Diagnosis Example

Applying the generated flame graph to an ALB instance revealed that a large fraction of CPU time was spent in the Prometheus client library handling metrics. The library was not optimized for the multi‑threaded environment of Nginx workers. Upgrading to a newer, thread‑aware version reduced the metrics‑related latency to an acceptable level.

Limitations of SystemTap

SystemTap runs as a kernel module; a buggy script can crash the kernel. Although the compiler performs strict checks, the risk of system instability remains.

Proposed eBPF Rewrite

Design Goals

Trace pointer fields in global_State and lua_State to walk the Lua call stack.

Capture the function name (or C symbol) for each frame.

Extracting Offsets with pahole

pahole --compile -C GCobj,GG_State,lua_State,global_State /path/to/libluajit-5.1.so.2.1.0 > offsets.h
sed -i '/.*typedef.*__uint64_t.*/d' offsets.h
sed -i '/.*typedef.*__int64_t.*/d' offsets.h
sed -i 's/Node/LJNode/g' offsets.h

Sample excerpt for struct global_State:

struct global_State {
    lua_Alloc                  allocf;               /*     0     8 */
    void *                     allocd;               /*     8     8 */
    GCState                    gc;                   /*    16   104 */
    GCstr                      strempty;             /*   120    24 */
    uint8_t                    stremptyz;            /*   144     1 */
    // ...
}

Demo eBPF Program

#include <nginx.h>
#define READ_STRUCT(ret, ret_t, p, type, access) \
    do { \
        type val; \
        bpf_probe_read_user(&val, sizeof(type), p); \
        ret = (ret_t)((val)access); \
    } while (0)

void *GLP = (void *)0x7cc2e558c380; // placeholder address of global_State

void *luajit_G(void) {
    void *ret;
    READ_STRUCT(ret, void *, GLP, lua_State, .glref.ptr64);
    return ret;
}

void *luajit_cur_thread(void *g) {
    void *gco;
    size_t offset = offsetof(struct global_State, cur_L);
    READ_STRUCT(gco, void *, g + offset, struct GCRef, .gcptr64);
    return gco; // points to the current lua_State
}

References

ALB open‑source gateway repository: https://github.com/alauda/alb.git

stapxx project (lj‑lua‑stacks): https://github.com/Kong/stapxx

stap++ source code: https://github.com/Kong/stapxx/blob/kong/stap%2B%2B

pahole manual: https://linux.die.net/man/1/pahole

Performance MonitoringFlame GrapheBPFTracingLuaOpenRestySystemTap
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.