Understanding the ftrace Architecture: Ring Buffer, Tracers, and Trace Events in the Linux Kernel
This article explains the Linux kernel ftrace architecture, covering ring buffer principles and code, tracer implementations (function, function_graph, irq_off), trace events, dynamic and static instrumentation, and kprobe mechanisms, illustrating how tracing is integrated, managed, and optimized for various execution contexts.
1. Ring Buffer
The ring buffer is the foundation of the trace32 framework; all raw trace data are recorded in it. It resides in memory for high speed, uses a circular structure to avoid wasting space, and always provides the latest trace information.
It stores data in memory, minimizing performance impact.
The circular design allows safe, continuous writes without memory waste.
The main challenge is to trace correctly in many contexts (normal, interrupt, NMI, soft‑IRQ, etc.) without affecting system logic or performance.
1.1 Ring Buffer Design
Problems to solve:
Access to the ring buffer can be interrupted in any context, so mutual‑exclusion is required.
Traditional locks would cause heavy blocking across contexts, harming performance.
Solution: treat the buffer as a classic producer/consumer system.
Producer/Consumer mode – writers continuously fill the buffer; if it becomes full and no consumer reads, new data are dropped.
Overwrite mode – when full, the producer overwrites the oldest data.
Write constraints for a per‑CPU ring buffer:
Only one writer may write at a time.
Higher‑priority writers may pre‑empt lower‑priority ones.
Read constraints:
Only one reader works at a time, but reads can occur concurrently with writes.
Read operations never interrupt writes; writes may interrupt reads.
Two read modes are supported: simple iterator reads (which temporarily block writes) and parallel custom reads that use a separate reader page to avoid blocking.
1.2 Code Flow and Framework
Ring‑buffer allocation is performed by tracer_alloc_buffers which calls ring_buffer_alloc . The main data structures are shown in the following diagram (image omitted).
struct ring_buffer has a per‑CPU struct ring_buffer_per_cpu .
struct ring_buffer_per_cpu allocates pages sized according to the defined buffer size and links them into a ring.
struct buffer_page is a control structure; struct buffer_data_page holds the actual data, preceded by timestamp and commit fields.
Three pointers ( head_page , commit_page , tail_page ) manage the page ring for reads, write‑confirmation, and writes respectively.
ring_buffer_per_cpu->reader_page provides a dedicated page for the reader mode.
2. ftrace Kernel Registration
The ftrace framework first creates a series of debugfs nodes. After core registration, each trace feature follows a common three‑step flow:
Function instrumentation – insert trace functions at probe points.
Input trace data – when a probe fires, data (including filters and triggers) are stored in the ring buffer.
Output trace data – userspace or programs read and parse the trace data.
2.1 Function Tracer Implementation
Function tracing uses the _mcount() call inserted by GCC with the -pg option at every function entry. The kernel can replace the default _mcount implementation with its own.
2.1.1 Static Instrumentation
On ARM64 the static instrumentation path is arch/arm64/kernel/entry-ftrace.S . When CONFIG_DYNAMIC_FTRACE is disabled, each function call jumps to _mcount , which may be redirected to a specific tracer function if tracing is enabled; otherwise it returns via a stub.
2.1.2 Dynamic Instrumentation
Static ftrace incurs a large overhead because it instruments every function. Dynamic ftrace mitigates this by replacing the bl _mcount instruction with nop for functions that are not being traced, and restoring the call only for selected functions.
During initialization, scripts/recordmcount.pl records all _mcount locations and replaces them with nop .
When a tracer is enabled, the relevant nop entries are patched back to bl ftrace_caller .
The addresses of all _mcount call sites are stored in the _mcount_loc section (defined in include/asm-generic/vmlinux.lds.h ) and processed by kernel/trace/ftrace.c during boot.
2.1.3 irqsoff / preemptoff Tracers
irqsoff tracer records functions that run with interrupts disabled, highlighting the longest‑running sections that cause latency.
preemptoff tracer records functions that run with kernel preemption disabled.
preempt irqsoff tracer records functions where either preemption or interrupts are disabled.
All these tracers share the same hook function irqsoff_tracer_call() . The start points differ: local_irq_disable() for irqsoff and preempt_disable() for preemptoff.
2.2 Trace Event
Trace events use the static tracepoint mechanism, which defines a stub function and a list of callbacks. When the stub is hit, each registered callback writes its data to the ring buffer.
struct tracepoint {
const char *name; /* Tracepoint name */
struct static_key key;
void (*regfunc)(void);
void (*unregfunc)(void);
struct tracepoint_func __rcu *funcs;
};Typical operations on a tracepoint are:
Stub function: trace_##name()
Register callback: register_trace_##name()
Unregister callback: unregister_trace_##name()
Example from kernel/sched/core.c :
static void __sched notrace __schedule(bool preempt)
{
...
trace_sched_switch(preempt, prev, next);
...
}Adding a new trace event is simplified by the TRACE_EVENT() macro, which expands into the necessary stub, registration, and data‑recording code. This reduces boilerplate and keeps the kernel code clean.
3. kprobe Event
kprobe events provide dynamic instrumentation using breakpoint and single‑step exceptions, allowing probes to be placed at any instruction address.
kprobe : inserts a handler before and after the probed instruction ( kp.pre_handler() , kp.post_handler() ).
jprobe : only works on function entry.
kretprobe : replaces the return address to execute a handler after the function returns.
Although kprobes have slightly higher overhead than static tracepoints, they offer great flexibility for tracing arbitrary kernel locations.
Deepin Linux
Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.