Fundamentals 47 min read

Building a Simple LLVM-Based Compiler: Lexer, Parser, SSA, Optimizer, JIT and Mutable Variables

The tutorial walks through building a complete LLVM‑based Kaleidoscope compiler—from a tokenizing lexer and recursive‑descent parser, through AST construction, LLVM IR generation with SSA and phi nodes, optimization passes, JIT execution, and mutable variable handling via stack allocation—providing full C++ source examples.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Building a Simple LLVM-Based Compiler: Lexer, Parser, SSA, Optimizer, JIT and Mutable Variables

This article presents a comprehensive step‑by‑step tutorial on constructing a simple compiler using LLVM, following the Kaleidoscope tutorial series. It starts with the goal of implementing a compiler for the Kaleidoscope language and proceeds through each compilation phase, providing detailed C++ code examples.

Lexical Analysis (Lexer)

#include <llvm/ADT/APFloat.h>
// ... (other includes)
enum Token {
  TOKEN_EOF = -1,
  TOKEN_DEF = -2,
  TOKEN_EXTERN = -3,
  TOKEN_IDENTIFIER = -4,
  TOKEN_NUMBER = -5,
  TOKEN_IF = -6,
  TOKEN_THEN = -7,
  TOKEN_ELSE = -8,
  TOKEN_FOR = -9,
  TOKEN_IN = -10,
  TOKEN_BINARY = -11
};
// GetToken implementation with handling of identifiers, numbers, comments and keywords

The lexer converts the input source into a stream of tokens such as identifiers, numbers, and the newly added keywords ( if , then , else , for , in , binary ).

Parsing and AST Construction

// AST base class
class ExprAST { public: virtual ~ExprAST() {} virtual llvm::Value* CodeGen() = 0; };
// Number, Variable, Binary, Call, Prototype, Function, If, For AST nodes
// Example of parsing a number literal
std::unique_ptr
ParseNumberExpr() { auto result = std::make_unique
(g_number_val); GetNextToken(); return result; }
// Parsing an if‑expression
std::unique_ptr
ParseIfExpr() { GetNextToken(); // eat if
  auto cond = ParseExpression();
  GetNextToken(); // eat then
  auto thenExpr = ParseExpression();
  GetNextToken(); // eat else
  auto elseExpr = ParseExpression();
  return std::make_unique
(std::move(cond), std::move(thenExpr), std::move(elseExpr)); }
// Parsing a for‑expression
std::unique_ptr
ParseForExpr() { GetNextToken(); // eat for
  std::string varName = g_identifier_str; GetNextToken(); // eat var
  GetNextToken(); // eat =
  auto start = ParseExpression();
  GetNextToken(); // eat ,
  auto end = ParseExpression();
  GetNextToken(); // eat ,
  auto step = ParseExpression();
  GetNextToken(); // eat in
  auto body = ParseExpression();
  return std::make_unique
(varName, std::move(start), std::move(end), std::move(step), std::move(body)); }

The parser builds an Abstract Syntax Tree (AST) for expressions, function prototypes, definitions, extern declarations, and top‑level expressions.

Code Generation (LLVM IR)

// CodeGen for binary operators
llvm::Value* BinaryExprAST::CodeGen() {
  llvm::Value* L = LHS->CodeGen();
  llvm::Value* R = RHS->CodeGen();
  switch (Op) {
    case '<': return g_ir_builder.CreateUIToFP(g_ir_builder.CreateFCmpULT(L, R, "cmptmp"), llvm::Type::getDoubleTy(g_llvm_context), "booltmp");
    case '+': return g_ir_builder.CreateFAdd(L, R, "addtmp");
    case '-': return g_ir_builder.CreateFSub(L, R, "subtmp");
    case '*': return g_ir_builder.CreateFMul(L, R, "multmp");
    default: {
      llvm::Function* Callee = GetFunction(std::string("binary") + Op);
      llvm::Value* Args[] = {L, R};
      return g_ir_builder.CreateCall(Callee, Args, "binop");
    }
  }
}
// CodeGen for if‑expression using phi node
llvm::Value* IfExprAST::CodeGen() {
  llvm::Value* CondV = Cond->CodeGen();
  CondV = g_ir_builder.CreateFCmpONE(CondV, llvm::ConstantFP::get(g_llvm_context, llvm::APFloat(0.0)), "ifcond");
  llvm::Function* TheFunction = g_ir_builder.GetInsertBlock()->getParent();
  llvm::BasicBlock* ThenBB = llvm::BasicBlock::Create(g_llvm_context, "then", TheFunction);
  llvm::BasicBlock* ElseBB = llvm::BasicBlock::Create(g_llvm_context, "else");
  llvm::BasicBlock* MergeBB = llvm::BasicBlock::Create(g_llvm_context, "ifcont");
  g_ir_builder.CreateCondBr(CondV, ThenBB, ElseBB);
  // then block
  g_ir_builder.SetInsertPoint(ThenBB);
  llvm::Value* ThenV = Then->CodeGen();
  g_ir_builder.CreateBr(MergeBB);
  ThenBB = g_ir_builder.GetInsertBlock();
  // else block
  TheFunction->getBasicBlockList().push_back(ElseBB);
  g_ir_builder.SetInsertPoint(ElseBB);
  llvm::Value* ElseV = Else->CodeGen();
  g_ir_builder.CreateBr(MergeBB);
  ElseBB = g_ir_builder.GetInsertBlock();
  // merge block
  TheFunction->getBasicBlockList().push_back(MergeBB);
  g_ir_builder.SetInsertPoint(MergeBB);
  llvm::PHINode* PN = g_ir_builder.CreatePHI(llvm::Type::getDoubleTy(g_llvm_context), 2, "iftmp");
  PN->addIncoming(ThenV, ThenBB);
  PN->addIncoming(ElseV, ElseBB);
  return PN;
}

The generated IR is printed for each step, showing how high‑level constructs map to LLVM instructions.

Optimization Passes

llvm::legacy::FunctionPassManager g_fpm(&g_module);
g_fpm.add(llvm::createInstructionCombiningPass());
g_fpm.add(llvm::createReassociatePass());
g_fpm.add(llvm::createGVNPass());
g_fpm.add(llvm::createCFGSimplificationPass());
g_fpm.doInitialization();
// In FunctionAST::CodeGen after creating the function:
g_fpm->run(*Func);

These passes perform constant folding, instruction combining, dead‑code elimination, and control‑flow simplification.

Just‑In‑Time (JIT) Execution

#include "KaleidoscopeJIT.h"
std::unique_ptr
g_jit;
int main() {
  llvm::InitializeNativeTarget();
  llvm::InitializeNativeTargetAsmPrinter();
  llvm::InitializeNativeTargetAsmParser();
  g_jit.reset(new llvm::orc::KaleidoscopeJIT);
  g_module->setDataLayout(g_jit->getTargetMachine().createDataLayout());
  // REPL loop: parse, codegen, add module, find symbol, execute
}

The JIT compiles functions on the fly, looks up symbols (including external C functions like sin and cos ), and executes top‑level expressions.

Static Single Assignment (SSA) and Phi Nodes

// Example of SSA transformation for a variable used in multiple branches
%y1 = fadd double %a, %b
%y2 = fadd double %c, %d
%y = phi double [ %y1, %bb1 ], [ %y2, %bb2 ]

The tutorial explains why LLVM uses SSA form for registers while memory objects remain in stack slots, and how the mem2reg pass converts stack allocations into SSA registers automatically.

Mutable Variables via Stack Allocation

// Allocate a mutable variable in the entry block
llvm::AllocaInst* CreateEntryBlockAlloca(llvm::Function* F, const std::string& VarName) {
  llvm::IRBuilder<> TmpB(&F->getEntryBlock(), F->getEntryBlock().begin());
  return TmpB.CreateAlloca(llvm::Type::getDoubleTy(g_llvm_context), 0, VarName.c_str());
}
// Variable expression now loads from the alloca
llvm::Value* VariableExprAST::CodeGen() {
  llvm::AllocaInst* A = g_named_values.at(Name);
  return g_ir_builder.CreateLoad(A, Name.c_str());
}
// Assignment (e.g., in a for‑loop) stores to the alloca
llvm::Value* ForExprAST::CodeGen() {
  llvm::AllocaInst* VarAlloca = CreateEntryBlockAlloca(Func, VarName);
  llvm::Value* StartVal = Start->CodeGen();
  g_ir_builder.CreateStore(StartVal, VarAlloca);
  // loop body uses loads and stores; after the loop the alloca is removed from the map
}

By allocating parameters and loop variables on the stack and then running the mem2reg (promote memory to register) pass, the compiler automatically generates phi nodes where needed, achieving mutable semantics without manual SSA construction.

Additional Features

User‑defined binary operators ( binary > , binary | , binary = ) with custom precedence.

Support for external functions (e.g., printd ) and linking to host symbols.

Full REPL loop that parses definitions, externs, and top‑level expressions, compiles them, and executes them immediately.

The article concludes with a complete source code listing, references to further reading on SSA, LLVM passes, and JIT compilation, and author information.

code generationCompilerJITLLVMKaleidoscopeoptimizerSSA
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.