Fundamentals 9 min read

How to Build a JavaScript Lexer for Arithmetic Expressions Using a Finite State Machine

This article explains how to implement a lexical analyzer in JavaScript that tokenizes simple arithmetic expressions by using a finite state machine, covering the conversion from infix notation to an abstract syntax tree, token definitions, state transitions, and complete source code examples.

WecTeam

Oct 24, 2019

How to Build a JavaScript Lexer for Arithmetic Expressions Using a Finite State Machine

0 Introduction

In the previous article the author introduced the process of obtaining an Abstract Syntax Tree (AST) – code → lexical analysis → syntax analysis → AST. This follow‑up dives into the lexical analysis stage and shows how to build a simple lexer that converts a four‑operator arithmetic expression into an AST.

1 Human and Computer Perspectives on Expressions

Humans prefer the infix form a + b or (a + b) * (c + d) because it is intuitive, but it requires many parentheses to express precedence. For computers, infix notation is complex, so we need to transform it into a tree structure (AST) for easier processing.

2 JavaScript and Abstract Syntax Trees (AST)

Almost every language generates a tree‑like intermediate representation during compilation or interpretation. While some languages expose this structure directly (e.g., Lisp, Elixir, Python), JavaScript does not, but we can implement the same process in JavaScript itself.

3 Lexical Analysis (Lexer)

Lexical analysis is similar to word segmentation: it scans the input string and produces a stream of meaningful tokens that are then fed to the parser.

3.1 Finite State Machine

Most language lexers are implemented with a finite state machine (FSM). The diagram below illustrates the overall FSM used in this lexer.

3.2 Start State

The initial state is start. When a digit (0‑9) is read, the machine transitions to inInt. When a dot (.) is read, it moves to inFloat. When an operator (+, -, *, /) is read, a SIGN token is emitted and the machine stays in start. When EOF is encountered, an EOF token is emitted and the machine returns to start.

3.3 In Integer (inInt) State

In inInt, digits keep the machine in the same state, a dot moves it to inFloat, and any non‑digit/non‑dot character causes a NUMBER token to be emitted and a transition back to start. The diagram below shows the inInt transitions.

3.4 In Float (inFloat) State

In inFloat, digits keep the machine in inFloat. A second dot triggers an error ("cannot have `..`"). Any non‑digit character causes a NUMBER token to be emitted, the buffer is cleared, and the machine returns to start. The diagram below illustrates the inFloat transitions.

3.5 Token Types and Definitions

The lexer produces four kinds of tokens: NUMBER (covers both integer and float literals), SIGN (operators), EOF, and the internal representation of a token:

interface Token {
  type: String,
  value: String,
}

3.6 Complete Lexer Implementation

const EOF = Symbol('EOF');

class Lexer {
  constructor(){
    this.token = [];
    this.tokens = [];
    this.state = this.start;
  }
  start(char) {
    if ("0123456789".includes(char)) { this.token.push(char); return this.inInt; }
    if (char === ".") { this.token.push(char); return this.inFloat; }
    if ("+-*/".includes(char)) { this.emmitToken("SIGN", char); return this.start; }
    if (char === EOF) { this.emmitToken("EOF", EOF); return this.start; }
  }
  inInt(char) {
    if ("0123456789".includes(char)) { this.token.push(char); return this.inInt; }
    else if (char === ".") { this.token.push(char); return this.inFloat; }
    else { this.emmitToken("NUMBER", this.token.join("")); this.token = []; return this.start(char); }
  }
  inFloat(char) {
    if ("0123456789".includes(char)) { this.token.push(char); return this.inFloat; }
    else if (char === ".") { throw new Error("cannot have `..`"); }
    else {
      if (this.token.length === 1 && this.token[0] === ".") throw new Error("cannot have solitary `.`");
      this.emmitToken("NUMBER", this.token.join("")); this.token = []; return this.start(char);
    }
  }
  emmitToken(type, value) { this.tokens.push({type, value}); }
  push(char) { this.state = this.state(char); return this.check(); }
  end() { this.state(EOF); return this.check(); }
  check() { const _token = [...this.tokens]; this.tokens = []; return _token; }
  clear() { this.token = []; this.tokens = []; this.state = this.start; }
}

const lexer = new Lexer();
const input = `1 + 2.3`;
let tokens = [];
for (let c of input.split('')) { tokens = [...tokens, ...lexer.push(c)]; }
tokens = [...tokens, ...lexer.end()];

3.7 Summary

Using a finite state machine we have successfully implemented lexical analysis for simple arithmetic expressions; the next step will be syntax analysis to complete the AST construction.

JavaScript AST Parsing tokenization Finite State Machine

Written by

WecTeam

WecTeam (维C团) is the front‑end technology team of JD.com’s Jingxi business unit, focusing on front‑end engineering, web performance optimization, mini‑program and app development, serverless, multi‑platform reuse, and visual building.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.