How to Build a JavaScript Lexer for Arithmetic Expressions Using a Finite State Machine
This article explains how to implement a lexical analyzer in JavaScript that tokenizes simple arithmetic expressions by using a finite state machine, covering the conversion from infix notation to an abstract syntax tree, token definitions, state transitions, and complete source code examples.
0 Introduction
In the previous article the author introduced the process of obtaining an Abstract Syntax Tree (AST) – code → lexical analysis → syntax analysis → AST. This follow‑up dives into the lexical analysis stage and shows how to build a simple lexer that converts a four‑operator arithmetic expression into an AST.
1 Human and Computer Perspectives on Expressions
Humans prefer the infix form a + b or (a + b) * (c + d) because it is intuitive, but it requires many parentheses to express precedence. For computers, infix notation is complex, so we need to transform it into a tree structure (AST) for easier processing.
2 JavaScript and Abstract Syntax Trees (AST)
Almost every language generates a tree‑like intermediate representation during compilation or interpretation. While some languages expose this structure directly (e.g., Lisp, Elixir, Python), JavaScript does not, but we can implement the same process in JavaScript itself.
3 Lexical Analysis (Lexer)
Lexical analysis is similar to word segmentation: it scans the input string and produces a stream of meaningful tokens that are then fed to the parser.
3.1 Finite State Machine
Most language lexers are implemented with a finite state machine (FSM). The diagram below illustrates the overall FSM used in this lexer.
3.2 Start State
The initial state is start. When a digit (0‑9) is read, the machine transitions to inInt. When a dot (.) is read, it moves to inFloat. When an operator (+, -, *, /) is read, a SIGN token is emitted and the machine stays in start. When EOF is encountered, an EOF token is emitted and the machine returns to start.
3.3 In Integer (inInt) State
In inInt, digits keep the machine in the same state, a dot moves it to inFloat, and any non‑digit/non‑dot character causes a NUMBER token to be emitted and a transition back to start. The diagram below shows the inInt transitions.
3.4 In Float (inFloat) State
In inFloat, digits keep the machine in inFloat. A second dot triggers an error ("cannot have `..`"). Any non‑digit character causes a NUMBER token to be emitted, the buffer is cleared, and the machine returns to start. The diagram below illustrates the inFloat transitions.
3.5 Token Types and Definitions
The lexer produces four kinds of tokens: NUMBER (covers both integer and float literals), SIGN (operators), EOF, and the internal representation of a token:
interface Token {
type: String,
value: String,
}3.6 Complete Lexer Implementation
const EOF = Symbol('EOF');
class Lexer {
constructor(){
this.token = [];
this.tokens = [];
this.state = this.start;
}
start(char) {
if ("0123456789".includes(char)) { this.token.push(char); return this.inInt; }
if (char === ".") { this.token.push(char); return this.inFloat; }
if ("+-*/".includes(char)) { this.emmitToken("SIGN", char); return this.start; }
if (char === EOF) { this.emmitToken("EOF", EOF); return this.start; }
}
inInt(char) {
if ("0123456789".includes(char)) { this.token.push(char); return this.inInt; }
else if (char === ".") { this.token.push(char); return this.inFloat; }
else { this.emmitToken("NUMBER", this.token.join("")); this.token = []; return this.start(char); }
}
inFloat(char) {
if ("0123456789".includes(char)) { this.token.push(char); return this.inFloat; }
else if (char === ".") { throw new Error("cannot have `..`"); }
else {
if (this.token.length === 1 && this.token[0] === ".") throw new Error("cannot have solitary `.`");
this.emmitToken("NUMBER", this.token.join("")); this.token = []; return this.start(char);
}
}
emmitToken(type, value) { this.tokens.push({type, value}); }
push(char) { this.state = this.state(char); return this.check(); }
end() { this.state(EOF); return this.check(); }
check() { const _token = [...this.tokens]; this.tokens = []; return _token; }
clear() { this.token = []; this.tokens = []; this.state = this.start; }
}
const lexer = new Lexer();
const input = `1 + 2.3`;
let tokens = [];
for (let c of input.split('')) { tokens = [...tokens, ...lexer.push(c)]; }
tokens = [...tokens, ...lexer.end()];3.7 Summary
Using a finite state machine we have successfully implemented lexical analysis for simple arithmetic expressions; the next step will be syntax analysis to complete the AST construction.
WecTeam
WecTeam (维C团) is the front‑end technology team of JD.com’s Jingxi business unit, focusing on front‑end engineering, web performance optimization, mini‑program and app development, serverless, multi‑platform reuse, and visual building.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
