Building a Simple LLVM-Based Compiler: Lexer, Parser, SSA, Optimizer, JIT and Mutable Variables
The tutorial walks through building a complete LLVM‑based Kaleidoscope compiler—from a tokenizing lexer and recursive‑descent parser, through AST construction, LLVM IR generation with SSA and phi nodes, optimization passes, JIT execution, and mutable variable handling via stack allocation—providing full C++ source examples.
This article presents a comprehensive step‑by‑step tutorial on constructing a simple compiler using LLVM, following the Kaleidoscope tutorial series. It starts with the goal of implementing a compiler for the Kaleidoscope language and proceeds through each compilation phase, providing detailed C++ code examples.
Lexical Analysis (Lexer)
#include <llvm/ADT/APFloat.h>
// ... (other includes)
enum Token {
TOKEN_EOF = -1,
TOKEN_DEF = -2,
TOKEN_EXTERN = -3,
TOKEN_IDENTIFIER = -4,
TOKEN_NUMBER = -5,
TOKEN_IF = -6,
TOKEN_THEN = -7,
TOKEN_ELSE = -8,
TOKEN_FOR = -9,
TOKEN_IN = -10,
TOKEN_BINARY = -11
};
// GetToken implementation with handling of identifiers, numbers, comments and keywordsThe lexer converts the input source into a stream of tokens such as identifiers, numbers, and the newly added keywords ( if , then , else , for , in , binary ).
Parsing and AST Construction
// AST base class
class ExprAST { public: virtual ~ExprAST() {} virtual llvm::Value* CodeGen() = 0; };
// Number, Variable, Binary, Call, Prototype, Function, If, For AST nodes
// Example of parsing a number literal
std::unique_ptr
ParseNumberExpr() { auto result = std::make_unique
(g_number_val); GetNextToken(); return result; }
// Parsing an if‑expression
std::unique_ptr
ParseIfExpr() { GetNextToken(); // eat if
auto cond = ParseExpression();
GetNextToken(); // eat then
auto thenExpr = ParseExpression();
GetNextToken(); // eat else
auto elseExpr = ParseExpression();
return std::make_unique
(std::move(cond), std::move(thenExpr), std::move(elseExpr)); }
// Parsing a for‑expression
std::unique_ptr
ParseForExpr() { GetNextToken(); // eat for
std::string varName = g_identifier_str; GetNextToken(); // eat var
GetNextToken(); // eat =
auto start = ParseExpression();
GetNextToken(); // eat ,
auto end = ParseExpression();
GetNextToken(); // eat ,
auto step = ParseExpression();
GetNextToken(); // eat in
auto body = ParseExpression();
return std::make_unique
(varName, std::move(start), std::move(end), std::move(step), std::move(body)); }The parser builds an Abstract Syntax Tree (AST) for expressions, function prototypes, definitions, extern declarations, and top‑level expressions.
Code Generation (LLVM IR)
// CodeGen for binary operators
llvm::Value* BinaryExprAST::CodeGen() {
llvm::Value* L = LHS->CodeGen();
llvm::Value* R = RHS->CodeGen();
switch (Op) {
case '<': return g_ir_builder.CreateUIToFP(g_ir_builder.CreateFCmpULT(L, R, "cmptmp"), llvm::Type::getDoubleTy(g_llvm_context), "booltmp");
case '+': return g_ir_builder.CreateFAdd(L, R, "addtmp");
case '-': return g_ir_builder.CreateFSub(L, R, "subtmp");
case '*': return g_ir_builder.CreateFMul(L, R, "multmp");
default: {
llvm::Function* Callee = GetFunction(std::string("binary") + Op);
llvm::Value* Args[] = {L, R};
return g_ir_builder.CreateCall(Callee, Args, "binop");
}
}
}
// CodeGen for if‑expression using phi node
llvm::Value* IfExprAST::CodeGen() {
llvm::Value* CondV = Cond->CodeGen();
CondV = g_ir_builder.CreateFCmpONE(CondV, llvm::ConstantFP::get(g_llvm_context, llvm::APFloat(0.0)), "ifcond");
llvm::Function* TheFunction = g_ir_builder.GetInsertBlock()->getParent();
llvm::BasicBlock* ThenBB = llvm::BasicBlock::Create(g_llvm_context, "then", TheFunction);
llvm::BasicBlock* ElseBB = llvm::BasicBlock::Create(g_llvm_context, "else");
llvm::BasicBlock* MergeBB = llvm::BasicBlock::Create(g_llvm_context, "ifcont");
g_ir_builder.CreateCondBr(CondV, ThenBB, ElseBB);
// then block
g_ir_builder.SetInsertPoint(ThenBB);
llvm::Value* ThenV = Then->CodeGen();
g_ir_builder.CreateBr(MergeBB);
ThenBB = g_ir_builder.GetInsertBlock();
// else block
TheFunction->getBasicBlockList().push_back(ElseBB);
g_ir_builder.SetInsertPoint(ElseBB);
llvm::Value* ElseV = Else->CodeGen();
g_ir_builder.CreateBr(MergeBB);
ElseBB = g_ir_builder.GetInsertBlock();
// merge block
TheFunction->getBasicBlockList().push_back(MergeBB);
g_ir_builder.SetInsertPoint(MergeBB);
llvm::PHINode* PN = g_ir_builder.CreatePHI(llvm::Type::getDoubleTy(g_llvm_context), 2, "iftmp");
PN->addIncoming(ThenV, ThenBB);
PN->addIncoming(ElseV, ElseBB);
return PN;
}The generated IR is printed for each step, showing how high‑level constructs map to LLVM instructions.
Optimization Passes
llvm::legacy::FunctionPassManager g_fpm(&g_module);
g_fpm.add(llvm::createInstructionCombiningPass());
g_fpm.add(llvm::createReassociatePass());
g_fpm.add(llvm::createGVNPass());
g_fpm.add(llvm::createCFGSimplificationPass());
g_fpm.doInitialization();
// In FunctionAST::CodeGen after creating the function:
g_fpm->run(*Func);These passes perform constant folding, instruction combining, dead‑code elimination, and control‑flow simplification.
Just‑In‑Time (JIT) Execution
#include "KaleidoscopeJIT.h"
std::unique_ptr
g_jit;
int main() {
llvm::InitializeNativeTarget();
llvm::InitializeNativeTargetAsmPrinter();
llvm::InitializeNativeTargetAsmParser();
g_jit.reset(new llvm::orc::KaleidoscopeJIT);
g_module->setDataLayout(g_jit->getTargetMachine().createDataLayout());
// REPL loop: parse, codegen, add module, find symbol, execute
}The JIT compiles functions on the fly, looks up symbols (including external C functions like sin and cos ), and executes top‑level expressions.
Static Single Assignment (SSA) and Phi Nodes
// Example of SSA transformation for a variable used in multiple branches
%y1 = fadd double %a, %b
%y2 = fadd double %c, %d
%y = phi double [ %y1, %bb1 ], [ %y2, %bb2 ]The tutorial explains why LLVM uses SSA form for registers while memory objects remain in stack slots, and how the mem2reg pass converts stack allocations into SSA registers automatically.
Mutable Variables via Stack Allocation
// Allocate a mutable variable in the entry block
llvm::AllocaInst* CreateEntryBlockAlloca(llvm::Function* F, const std::string& VarName) {
llvm::IRBuilder<> TmpB(&F->getEntryBlock(), F->getEntryBlock().begin());
return TmpB.CreateAlloca(llvm::Type::getDoubleTy(g_llvm_context), 0, VarName.c_str());
}
// Variable expression now loads from the alloca
llvm::Value* VariableExprAST::CodeGen() {
llvm::AllocaInst* A = g_named_values.at(Name);
return g_ir_builder.CreateLoad(A, Name.c_str());
}
// Assignment (e.g., in a for‑loop) stores to the alloca
llvm::Value* ForExprAST::CodeGen() {
llvm::AllocaInst* VarAlloca = CreateEntryBlockAlloca(Func, VarName);
llvm::Value* StartVal = Start->CodeGen();
g_ir_builder.CreateStore(StartVal, VarAlloca);
// loop body uses loads and stores; after the loop the alloca is removed from the map
}By allocating parameters and loop variables on the stack and then running the mem2reg (promote memory to register) pass, the compiler automatically generates phi nodes where needed, achieving mutable semantics without manual SSA construction.
Additional Features
User‑defined binary operators ( binary > , binary | , binary = ) with custom precedence.
Support for external functions (e.g., printd ) and linking to host symbols.
Full REPL loop that parses definitions, externs, and top‑level expressions, compiles them, and executes them immediately.
The article concludes with a complete source code listing, references to further reading on SSA, LLVM passes, and JIT compilation, and author information.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.