How a 500‑Line C Compiler Self‑Compiles: Inside the C4 Project
This article demystifies the C4 compiler—a minimalist C compiler written in just 528 lines and four functions—by explaining its architecture, showing a Hello World build, demonstrating self‑hosting, and detailing the supported language subset, bytecode format, and unique implementation tricks.
Overview of C4
C4 (C in four functions) is an ultra‑compact C compiler written in 528 lines of C code, consisting of only four functions. Despite its size, it implements a full compilation pipeline: lexical analysis, parsing, semantic checking, code generation, and a tiny stack‑based virtual machine that executes the generated bytecode.
Hello World Example
First compile the C4 source with GCC: gcc c4.c -o c4 Then compile and run a simple program hello.c using the newly built C4 compiler: ./c4 hello.c The output shows the program’s printed text and a line like exit(0) cycle = 9, indicating successful execution and that nine bytecode instructions were generated.
Self‑Hosting Demonstration
Self‑hosting means the compiler can compile its own source code. Using the previously built c4 (named A ) to compile c4.c produces a new compiler B : ./c4 c4.c hello.c Running the same command again ( ./c4 c4.c c4.c hello.c) creates a third compiler C , which finally compiles and runs hello.c. Each recursion level generates more bytecode and takes longer to execute.
Supported Subset of the C Language
C4 deliberately implements only the minimal subset required for self‑hosting:
Data types: char, int, pointers, enum, arrays, strings (no struct, typedef, union).
Statements: if‑else, while, return, function definitions (no do‑while, switch, for, goto, etc.).
Operators: almost all arithmetic, relational, logical, bitwise, assignment, and the ternary ?: operator (except compound assignments like +=, %=, <<=, &=).
Built‑in library functions used by the compiler: open, read, close, printf, malloc, free, memset, memcmp, exit.
Preprocessor directives ( #include, #define, etc.) and multi‑line comments ( /* … */) are not supported; only single‑line // comments are recognized.
Typical Compiler Pipeline (Reference)
Traditional compilers follow these stages, each usually requiring a separate pass over the source:
Lexical analysis
Syntax analysis
Semantic analysis
Intermediate code generation
Optimization passes
Machine code generation
C4 merges lexical analysis, parsing, semantic checks, and code generation into a single pass, similar to the Lua interpreter, which contributes to its speed.
Unique Features of C4
Instead of emitting native machine code, C4 translates C source into a custom 39‑instruction bytecode executed by a tiny stack‑based virtual machine, akin to Java or Lua. The VM includes special instructions for the built‑in library calls.
The compiler scans the source only once, which, combined with the simple VM design, yields performance comparable to Lua’s interpreter.
Debugging is supported via a -d flag that dumps the generated bytecode; an example dump is shown below:
Conclusion
C4’s extreme minimalism makes it an excellent learning project for anyone interested in compiler construction. With only four functions and about 500 lines of code, it is approachable for beginners who have a basic understanding of compiler theory. The source is publicly available on GitHub for further exploration.
Liangxu Linux
Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
