Building a C Static Analyzer with MoonBit: From Lexer to Data‑Flow Analysis

This article walks through the design and implementation of a static analysis framework for C programs using MoonBit, covering lexical and syntactic parsing, AST generation, conversion to a control‑flow‑graph IR, and data‑flow analyses such as uninitialized‑variable detection and liveness.

21CTO
21CTO
21CTO
Building a C Static Analyzer with MoonBit: From Lexer to Data‑Flow Analysis

Introduction

Static analysis detects bugs such as use‑of‑uninitialized variables or possible null dereferences by inspecting source code without executing it. This summary describes the pipeline used to build a static analyzer for a tiny language and then for real‑world C programs using MoonBit’s MCIL (MoonBit C Intermediate Language).

Example Language

var x
var y = input()
if y > 0 {
    x = y + 1
}
return x

The program is buggy because x is only assigned in the if branch; the else path returns an uninitialized x. The analyzer will flag this.

Lexical and Syntactic Analysis

Lexer – converts the character stream into a token stream (e.g., var x = 0Var, Ident("x"), Eq, IntLit(0)).

Parser – builds an abstract syntax tree (AST) from the token stream using recursive‑descent parsing.

Compiler vs. Static Analyzer

Both share the front‑end pipeline source → lexer → parser → AST → IR. A compiler proceeds to code generation, while a static analyzer stops at the IR and emits warnings.

CIL Core Idea: Flatten Control Flow

CIL replaces nested control structures with basic blocks —linear sequences of instructions ending with an explicit jump. Example transformation:

if cond {          // → Block0: if cond goto Block1 else Block2
    A               // Block1: A; goto Block3
} else {           // Block2: B; goto Block3
    B               // Block3: ...
}

Loops are similarly expressed with back‑edges. Expressions are lowered to three‑address code (TAC):

t1 = z * w
t2 = y + t1
x = t2

IR definitions (MoonBit syntax):

enum Instr {
    BinOp(Operand, BinOp, Operand, Operand) // dest = left op right
    Copy(Operand, Operand)                // dest = src
    Call(Operand, String, Array[Operand]) // dest = func(args...)
}

enum Terminator {
    CondBr(Operand, Int, Int) // if cond goto block1 else block2
    Goto(Int)                // goto block
    Return(Operand)           // return value
}

struct Block {
    id    : Int
    instrs: Array[Instr]
    term  : Terminator
    preds : Array[Int]
    succs : Array[Int]
}

Data‑Flow Analysis on the CFG

Using a control‑flow graph (CFG) of basic blocks, we perform reaching‑definitions analysis to detect uninitialized variables:

Each assignment creates a new definition and kills previous definitions of the same variable.

At join points we take the union of definition sets.

If a variable is used with no reaching definition (and it is not a parameter), emit a warning.

for block in fun_ir.blocks {
    let mut defs = reaching.defs_in[block.id]
    for instr in block.instrs {
        for var in get_uses(instr) {
            if not(defs.has_def(var)) && not(is_param(var)) {
                warn("Variable may be uninitialized: " + var)
            }
        }
        match get_def(instr) {
            Some(var) => defs = defs.update(var, current_def)
            None => {}
        }
    }
}

Running this on the example yields:

Warning: Variable may be uninitialized: x (at Block 3, Return)

Forward vs. Backward Analyses

Forward analyses (e.g., reaching definitions) propagate from entry to exit; backward analyses (e.g., liveness) propagate from exit to entry. The framework is symmetric; only the direction of the transfer and merge functions changes.

struct ForwardConfig[T] {
    init    : T
    transfer: (Block, T) -> T   // entry → exit
    merge   : (T, T) -> T        // merge predecessors
    equal   : (T, T) -> Bool
    copy    : (T) -> T
}

struct BackwardConfig[T] {
    init    : T                     // exit initial state
    transfer: (Block, T) -> T       // exit → entry
    merge   : (T, T) -> T           // merge successors
    equal   : (T, T) -> Bool
    copy    : (T) -> T
}

Uninitialized‑Variable Detection Implementation

The concrete implementation follows the algorithm above, iterating over blocks until a fixed point is reached and emitting warnings for uses without reaching definitions.

MCIL – Full‑Featured C Analyzer

MCIL is a complete MoonBit implementation of CIL capable of handling real C programs. It includes:

Full token set ( sizeof, typedef, struct, ->, etc.).

Parser for the entire C grammar (function pointers, designated initializers, GCC extensions). cabs2cil conversion that normalizes types, inserts explicit casts, resolves scopes, and performs constant folding.

The same analysis framework as MiniCIL, enabling reuse of data‑flow engines for C.

Challenges in C‑Language Analysis

Pointer aliasing – determining which variables a dereferenced pointer may affect.

Inter‑procedural analysis – tracking data flow across function calls (e.g., detecting use‑after‑free).

Complex type system – handling array‑to‑pointer decay, struct layout, unions, and implicit conversions.

Undefined behavior – detecting overflow, null dereference, out‑of‑bounds access while keeping false positives low.

Conclusion

The end‑to‑end workflow consists of lexical and syntactic parsing to build an AST, conversion to a basic‑block CFG, and a generic data‑flow engine that iterates to a fixed point. Using this engine we implemented both liveness analysis and uninitialized‑variable detection, demonstrating the reusability of the framework. While CIL remains a valuable teaching tool, modern compilers now favor SSA‑based IRs such as LLVM/Clang for greater extensibility.

References

CIL original paper: "CIL: Intermediate Language and Tools for Analysis and Transformation of C Programs"

MiniCIL source: https://github.com/Lampese/MiniCIL

MCIL source: https://github.com/Lampese/MCIL

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

CompilerMoonBitdata flowuninitialized variableintermediate representationCIL
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.