How to Profile WebAssembly Performance with Code Instrumentation and Flame Graphs

This article explains how to profile WebAssembly code by inserting lightweight instrumentation hooks, optimizing the overhead with a tree‑based call‑stack, generating flame‑graph data, and demonstrates the whole workflow with a sample Fibonacci program compiled to WASM.

Alibaba Cloud Developer
Alibaba Cloud Developer
Alibaba Cloud Developer
How to Profile WebAssembly Performance with Code Instrumentation and Flame Graphs

WebAssembly and WAVM

WebAssembly (WASM) is a low‑level bytecode format compiled from languages such as C/C++ or Rust, originally designed for near‑native performance in browsers. Because of its portability, many projects run WASM outside browsers using a VM/Runtime. The article chooses WAVM, a JIT‑based WASM runtime without built‑in profiling support, to demonstrate profiling techniques.

Performance Analysis – Instrumenting WASM

The basic idea is to record timestamps at function entry and exit. Two schemes are possible: (1) timestamp at function entry and exit, (2) timestamp before and after each call. The article adopts the first scheme for simplicity.

Instrumentation Function Design

Initially a hash‑table implementation stores perf_start and perf_end data. When a function is entered, its name and entry time are pushed onto a stack; on exit, the stack is popped, the top function’s cost is recorded, and the call path is stored in the hash table. This approach incurs noticeable overhead because each perf_end traverses the stack and concatenates strings.

Optimization – Tree‑Based Implementation

To reduce overhead, a tree structure records the call graph. On perf_start the runtime looks for an existing child node or creates a new one, updates the entry time, and moves the global pointer to the child. On perf_end the function’s cost is recorded and the pointer moves back to the parent. After these optimizations the instrumentation overhead drops to about 3%.

void perf_start(int32_t func_id) {
  PerfNode* cur_node = perf_data->perf_node();
  if (!cur_node) { return; }
  // Get or create child node
  PerfNode* child_node = cur_node->GetChildNode(perf_data->buffer(), func_id);
  if (!child_node) { perf_data->UpdatePerfNode(NULL); return; }
  child_node->RecordEntry();
  perf_data->UpdatePerfNode(child_node);
}

void perf_end() {
  PerfNode* cur_node = perf_data->perf_node();
  if (!cur_node) { return; }
  cur_node->RecordExit();
  perf_data->UpdatePerfNode(cur_node->parent());
}

Other Optimizations

Pool allocation for newly created child nodes.

Use high‑resolution instructions such as rdtsc for sampling.

Limit instrumentation to a whitelist of functions to avoid overhead on trivial calls.

Instrumentation Process

Add perf_start and perf_end to the Import Section, creating corresponding function types if needed.

Adjust all function indices because the import adds entries.

Insert perf_start at the beginning of each function (with the function’s index as argument) and perf_end before every return or at function exit.

Update the Name Section with the new function names.

Instrumentation Tools

Parsing and rewriting WASM binaries can be done with Rust crates wasmparser, wasm‑encoder, wasmprinter or C++ tools like wabt. Existing projects such as wasm‑gas and paritytech/wasm‑instrument provide a basis; the article modifies wasm‑instrument to add the profiling hooks.

// Insert perf_start
func_builder.instruction(&Instruction::I32Const((current_func_index+2) as i32));
func_builder.instruction(&Instruction::Call(perf_start));

// Track block depth for return insertion
let mut block_depth = 0;
for op in operator_reader {
    let op = op?;
    match op {
        Operator::Call { function_index } => {
            handle_in_function_call(&mut func_builder, entry_func_index, exit_func_index, function_index)?;
        }
        Operator::Return => {
            func_builder.instruction(&Instruction::Call(exit_func_index));
            func_builder.instruction(&Instruction::Return);
        }
        Operator::Block { .. } | Operator::Loop { .. } | Operator::If { .. } | Operator::Try { .. } => {
            block_depth += 1;
            func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
        }
        Operator::End => {
            if block_depth == 0 {
                func_builder.instruction(&Instruction::Call(exit_func_index));
            }
            func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
            block_depth -= 1;
        }
        _ => {
            func_builder.instruction(&DefaultTranslator.translate_op(&op)?);
        }
    }
}
code_section_builder.function(&func_builder);

Generating Flame Graphs

After instrumentation, the runtime outputs a call‑cost file ( .folded) that can be turned into a flame graph using the standard stackcollapse‑perf.pl and flamegraph.pl scripts.

$FG_DIR/stackcollapse-perf.pl perf.unfold > perf.folded
$FG_DIR/flamegraph.pl perf.folded > perf.svg

Sample: Fibonacci

A simple Fibonacci program is compiled to WASM, instrumented, executed, and visualized. The resulting flame graph clearly shows the call hierarchy and relative costs, confirming that the instrumentation works.

Conclusion

Instrumentation‑based profiling of WASM is feasible but introduces overhead; careful design (tree‑based call graph, limited instrumentation) can keep overhead around 3%. The process adds complexity because the WASM binary must be rebuilt with additional sections.

References

https://webassembly.org/

https://madewithwebassembly.com/

https://zsummer.github.io/2021/02/19/2021-04-02-perf-clock/

https://github.com/WebAssembly/wabt

https://github.com/ewasm/sentinel-rs

https://github.com/paritytech/wasm-instrument

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

performance profilingWebAssemblyJITflame graphcode instrumentationWAVM
Alibaba Cloud Developer
Written by

Alibaba Cloud Developer

Alibaba's official tech channel, featuring all of its technology innovations.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.