Fundamentals 11 min read

How a 500‑Line C Compiler Self‑Compiles: Inside the C4 Project

This article demystifies the C4 compiler—a minimalist C compiler written in just 528 lines and four functions—by explaining its architecture, showing a Hello World build, demonstrating self‑hosting, and detailing the supported language subset, bytecode format, and unique implementation tricks.

Liangxu Linux
Liangxu Linux
Liangxu Linux
How a 500‑Line C Compiler Self‑Compiles: Inside the C4 Project

Overview of C4

C4 (C in four functions) is an ultra‑compact C compiler written in 528 lines of C code, consisting of only four functions. Despite its size, it implements a full compilation pipeline: lexical analysis, parsing, semantic checking, code generation, and a tiny stack‑based virtual machine that executes the generated bytecode.

Hello World Example

First compile the C4 source with GCC: gcc c4.c -o c4 Then compile and run a simple program hello.c using the newly built C4 compiler: ./c4 hello.c The output shows the program’s printed text and a line like exit(0) cycle = 9, indicating successful execution and that nine bytecode instructions were generated.

Self‑Hosting Demonstration

Self‑hosting means the compiler can compile its own source code. Using the previously built c4 (named A ) to compile c4.c produces a new compiler B : ./c4 c4.c hello.c Running the same command again ( ./c4 c4.c c4.c hello.c) creates a third compiler C , which finally compiles and runs hello.c. Each recursion level generates more bytecode and takes longer to execute.

Supported Subset of the C Language

C4 deliberately implements only the minimal subset required for self‑hosting:

Data types: char, int, pointers, enum, arrays, strings (no struct, typedef, union).

Statements: if‑else, while, return, function definitions (no do‑while, switch, for, goto, etc.).

Operators: almost all arithmetic, relational, logical, bitwise, assignment, and the ternary ?: operator (except compound assignments like +=, %=, <<=, &=).

Built‑in library functions used by the compiler: open, read, close, printf, malloc, free, memset, memcmp, exit.

Preprocessor directives ( #include, #define, etc.) and multi‑line comments ( /* … */) are not supported; only single‑line // comments are recognized.

Typical Compiler Pipeline (Reference)

Traditional compilers follow these stages, each usually requiring a separate pass over the source:

Lexical analysis

Syntax analysis

Semantic analysis

Intermediate code generation

Optimization passes

Machine code generation

C4 merges lexical analysis, parsing, semantic checks, and code generation into a single pass, similar to the Lua interpreter, which contributes to its speed.

Unique Features of C4

Instead of emitting native machine code, C4 translates C source into a custom 39‑instruction bytecode executed by a tiny stack‑based virtual machine, akin to Java or Lua. The VM includes special instructions for the built‑in library calls.

The compiler scans the source only once, which, combined with the simple VM design, yields performance comparable to Lua’s interpreter.

Debugging is supported via a -d flag that dumps the generated bytecode; an example dump is shown below:

C4 bytecode dump
C4 bytecode dump

Conclusion

C4’s extreme minimalism makes it an excellent learning project for anyone interested in compiler construction. With only four functions and about 500 lines of code, it is approachable for beginners who have a basic understanding of compiler theory. The source is publicly available on GitHub for further exploration.

bytecodeeducationalcompiler designC4self-hostingC Compiler
Liangxu Linux
Written by

Liangxu Linux

Liangxu, a self‑taught IT professional now working as a Linux development engineer at a Fortune 500 multinational, shares extensive Linux knowledge—fundamentals, applications, tools, plus Git, databases, Raspberry Pi, etc. (Reply “Linux” to receive essential resources.)

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.