Fundamentals 15 min read

Mastering Linux Kernel Tracepoints: How TRACE_EVENT() Automates Tracing

This article explains the evolution of Linux kernel tracepoints, the design of the TRACE_EVENT() macro, its six parameters, and how to define, compile, and use tracepoints such as sched_switch for efficient, low‑overhead kernel tracing across various tracers.

Big Data Technology Tribe
Big Data Technology Tribe
Big Data Technology Tribe
Mastering Linux Kernel Tracepoints: How TRACE_EVENT() Automates Tracing

In the history of Linux development, developers have long wanted to add static tracepoints to the kernel—functions that record data at specific locations for later retrieval. Because of performance concerns, early attempts were not very successful.

Unlike the Ftrace function tracer, tracepoints can record not only function entry but also local variables. Over time, various strategies were tried, and the TRACE_EVENT() macro became the latest way to add kernel tracepoints.

1. History

Mathieu Desnoyers created a low‑overhead tracer hook called “trace markers”. Although trace markers solved the performance issue with clever macros, they embedded printf‑style strings directly in core kernel code, which many kernel developers disliked because it made the code look like scattered debugging statements.

To appease developers, Mathieu introduced the concept of tracepoints. A tracepoint inserts a function call in the kernel; when enabled it invokes a callback that receives the tracepoint’s parameters, similar to a normal callback. This approach is far better than trace markers because it allows passing typed pointers that the callback can dereference, whereas the marker interface required parsing strings. Using tracepoints, the callback can efficiently retrieve any needed data from structures.

Although an improvement, creating a separate callback for each desired tracepoint was still cumbersome. The kernel needed a more automated way to connect tracers to tracepoints, automating callback creation and data formatting—similar to what trace markers did but performed inside the callback rather than at the tracepoint location.

To address this, the TRACE_EVENT() macro was introduced, inspired by Tom Zanussi’s zedtrace. The macro lets developers add tracepoints to their subsystems without understanding Ftrace internals. It is tracer‑agnostic, working with Ftrace, perf, LTTng, and SystemTap.

2. Introducing the TRACE_EVENT() macro

Automated tracepoints must satisfy several requirements:

It must create a tracepoint that can be placed in kernel code.

It must create a callback that can hook to that tracepoint.

The callback must record the data to the tracer’s ring buffer as efficiently as possible.

It must provide a function that can parse the recorded data and convert it to a human‑readable format.

The TRACE_EVENT() macro is split into six parameters:

// name - name of the tracepoint
// prototype - prototype of the tracepoint callback
// arguments - arguments matching the prototype
// structure - optional structure used by the tracer to store data
// assignment - C code that assigns data to the structure
// print - human‑readable ASCII format for printing the structure
TRACE_EVENT(name, proto, args, struct, assign, print)

A typical example is the sched_switch tracepoint. Its definition uses helper macros (TP_PROTO, TP_ARGS, TP_STRUCT__entry, TP_fast_assign, TP_printk) to handle commas inside macro arguments.

Name

The first parameter is the name, e.g. TRACE_EVENT(sched_switch, …). The actual tracepoint is referenced with the prefix trace_, resulting in trace_sched_switch.

Prototype

The second parameter provides the prototype, e.g.

TP_PROTO(struct rq *rq, struct task_struct *prev, struct task_struct *next)

, which becomes the function signature

trace_sched_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next);

Arguments

The third parameter lists the arguments used in the prototype, e.g. TP_ARGS(rq, prev, next).

Structure

The fourth parameter describes the layout of data stored in the tracer’s ring buffer. It uses macros like __array and __field to define each element.

TP_STRUCT__entry(
    __array(char, prev_comm, TASK_COMM_LEN)
    __field(pid_t, prev_pid)
    __field(int, prev_prio)
    __field(long, prev_state)
    __array(char, next_comm, TASK_COMM_LEN)
    __field(pid_t, next_pid)
    __field(int, next_prio)
)

This yields a C struct with the listed fields.

Assignment

The fifth parameter ( TP_fast_assign) contains ordinary C code that copies the arguments into the structure fields. A special variable __entry points to the structure in the ring buffer.

TP_fast_assign(
    memcpy(__entry->next_comm, next->comm, TASK_COMM_LEN);
    __entry->prev_pid = prev->pid;
    __entry->prev_prio = prev->prio;
    __entry->prev_state = prev->state;
    memcpy(__entry->prev_comm, prev->comm, TASK_COMM_LEN);
    __entry->next_pid = next->pid;
    __entry->next_prio = next->prio;
)

Print

The final parameter defines how to print the stored data using TP_printk. It formats the fields with printf -style strings and may use helper functions like __print_flags.

TP_printk("prev_comm=%s prev_pid=%d prev_prio=%d prev_state=%s => next_comm=%s next_pid=%d next_prio=%d",
    __entry->prev_comm, __entry->prev_pid, __entry->prev_prio,
    __entry->prev_state ? __print_flags(__entry->prev_state, "|",
        {1, "S"}, {2, "D"}, {4, "T"}, {8, "t"},
        {16, "Z"}, {32, "X"}, {64, "x"}, {128, "W"}) : "R",
    __entry->next_comm, __entry->next_pid, __entry->next_prio)

The generated format file (e.g., /sys/kernel/debug/tracing/events/sched/sched_switch/format) contains the fields and a print format string that user‑space tools use to decode the binary output.

Header files

TRACE_EVENT() macros must be placed in header files under include/trace/events (or another appropriate location) and follow a specific pattern with TRACE_SYSTEM and include guards. The header must also include <linux/tracepoint.h> and, after the macro definitions, include <trace/define_trace.h> outside the protection block.

Using tracepoints

To use a tracepoint, a C file that defines CREATE_TRACE_POINTS before including the header triggers the generation of the necessary functions. Other files can simply include the header to call the tracepoint, e.g., trace_sched_switch(rq, prev, next); inside the scheduler’s context switch code.

Reference links: https://github.com/torvalds/linux/blob/master/samples/trace_events/trace-events-sample.h, https://github.com/torvalds/linux/blob/master/samples/trace_events/trace-events-sample.c, https://lwn.net/Articles/379903/
DebuggingKernelLinuxftraceTrace Eventtracepoints
Big Data Technology Tribe
Written by

Big Data Technology Tribe

Focused on computer science and cutting‑edge tech, we distill complex knowledge into clear, actionable insights. We track tech evolution, share industry trends and deep analysis, helping you keep learning, boost your technical edge, and ride the digital wave forward.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.