Databases 12 min read

Understanding MySQL 8.0.20 Lexical Analysis: State Machine, Debugging Process, and Optimizations

This article explains MySQL 8.0.20's lexical analysis, describing the lexer’s role, the state‑machine implementation, step‑by‑step debugging with LLDB on macOS, the init_state_maps function, key source code snippets, and the optimizations introduced in this version.

Xueersi Online School Tech Team
Xueersi Online School Tech Team
Xueersi Online School Tech Team
Understanding MySQL 8.0.20 Lexical Analysis: State Machine, Debugging Process, and Optimizations

The author studies MySQL 8.0.20 source code, focusing on lexical analysis (lexer) and its related components such as the token parser and optimizer, and shares the findings as a series of technical notes.

What is lexical analysis? Lexical analysis converts a character stream into a token stream. The lexer (or scanner) is a function called by the parser, often generated by tools like Lex. It is the first and essential phase of compilation, reading the source program character by character and recognizing tokens according to lexical rules.

Lexical analysis state machine The state machine runs during the scanning phase. Images (omitted) illustrate how tokens are processed. The state machine determines actions for each character, for example, MY_LEX_IDENT loops to match identifiers and returns the corresponding token.

Debugging the lexer Using LLDB on macOS, the author launches two terminals: one for MySQL commands and another for LLDB. By setting breakpoints in MYSQLlex and lex_one_token , the execution flow is observed: the lexer starts with MY_LEX_START , calls yyPeek to fetch characters, skips whitespace, and transitions to appropriate states (e.g., MY_LEX_IDENT , MY_LEX_CHAR ). Screenshots (omitted) show the call chain and state transitions.

state_map implementation The core of the state machine is the init_state_maps function defined in mysys/sql_chars.cc . It initializes state_map and ident_map for all 256 possible byte values, assigning states such as MY_LEX_IDENT , MY_LEX_NUMBER_IDENT , MY_LEX_SKIP , MY_LEX_CHAR , and special symbols (e.g., '_' , '$' , '\'' , '.' , '>' , '=' , '!' , '<' , '&' , '|' , '#' , ';' , ':' , '/' , '*' , '@' , '`' , '"' ). The second map ( ident_map ) speeds up identifier detection.

bool init_state_maps(CHARSET_INFO *cs) {
  uint i;
  uchar *ident_map;
  enum my_lex_states *state_map = nullptr;
  lex_state_maps_st *lex_state_maps = (lex_state_maps_st *)my_once_alloc(
      sizeof(lex_state_maps_st), MYF(MY_WME));
  if (lex_state_maps == nullptr) return true; // OOM
  cs->state_maps = lex_state_maps;
  state_map = lex_state_maps->main_map;
  if (!(cs->ident_map = ident_map = (uchar *)my_once_alloc(256, MYF(MY_WME))))
    return true; // OOM
  hint_lex_init_maps(cs, lex_state_maps->hint_map);
  for (i = 0; i < 256; i++) {
    if (my_isalpha(cs, i))
      state_map[i] = MY_LEX_IDENT;
    else if (my_isdigit(cs, i))
      state_map[i] = MY_LEX_NUMBER_IDENT;
    else if (my_ismb1st(cs, i))
      state_map[i] = MY_LEX_IDENT;
    else if (my_isspace(cs, i))
      state_map[i] = MY_LEX_SKIP;
    else
      state_map[i] = MY_LEX_CHAR;
  }
  state_map[(uchar)'_'] = state_map[(uchar)'$'] = MY_LEX_IDENT;
  state_map[(uchar)'\''] = MY_LEX_STRING;
  state_map[(uchar)'.'] = MY_LEX_REAL_OR_POINT;
  state_map[(uchar)'>'] = state_map[(uchar)'='] = state_map[(uchar)'!'] = MY_LEX_CMP_OP;
  state_map[(uchar)'<'] = MY_LEX_LONG_CMP_OP;
  state_map[(uchar)'&'] = state_map[(uchar)'|'] = MY_LEX_BOOL;
  state_map[(uchar)'#'] = MY_LEX_COMMENT;
  state_map[(uchar)';'] = MY_LEX_SEMICOLON;
  state_map[(uchar)':'] = MY_LEX_SET_VAR;
  state_map[0] = MY_LEX_EOL;
  state_map[(uchar)'/'] = MY_LEX_LONG_COMMENT;
  state_map[(uchar)'*'] = MY_LEX_END_LONG_COMMENT;
  state_map[(uchar)'@'] = MY_LEX_USER_END;
  state_map[(uchar)'`'] = MY_LEX_USER_VARIABLE_DELIMITER;
  state_map[(uchar)'"'] = MY_LEX_STRING_OR_DELIMITER;
  for (i = 0; i < 256; i++) {
    ident_map[i] = (uchar)(state_map[i] == MY_LEX_IDENT ||
                           state_map[i] == MY_LEX_NUMBER_IDENT);
  }
}

Key lexer function: lex_one_token This function reads characters from the input stream, uses state_map to drive the state machine, handles whitespace, identifiers, numbers, strings, comments, and returns the appropriate token value. Important snippets are shown below.

static int lex_one_token(Lexer_yystype *yylval, THD *thd) {
  uchar c = 0;
  bool comment_closed;
  int tokval, result_state;
  uint length;
  enum my_lex_states state;
  Lex_input_stream *lip = &thd->m_parser_state->m_lip;
  const CHARSET_INFO *cs = thd->charset();
  const my_lex_states *state_map = cs->state_maps->main_map;
  const uchar *ident_map = cs->ident_map;
  lip->yylval = yylval;
  lip->start_token();
  state = lip->next_state;
  lip->next_state = MY_LEX_START;
  for (;;) {
    switch (state) {
      case MY_LEX_START:
        while (state_map[c = lip->yyPeek()] == MY_LEX_SKIP) {
          if (c == '\n') lip->yylineno++;
          lip->yySkip();
        }
        lip->restart_token();
        c = lip->yyGet();
        state = state_map[c];
        break;
      case MY_LEX_IDENT:
        // identifier handling (omitted for brevity)
        break;
      case MY_LEX_EOL:
        if (lip->eof()) {
          lip->yyUnget();
          lip->set_echo(false);
          lip->yySkip();
          lip->set_echo(true);
          if (lip->in_comment != NO_COMMENT) return ABORT_SYM;
          lip->next_state = MY_LEX_END;
          return END_OF_INPUT;
        }
        break;
      // other cases omitted
    }
  }
}

Optimizations in MySQL 8.0.20 Compared with MySQL 5.x, the lexer no longer performs an extra MY_LEX_OPERATOR_OR_IDENT step, reducing the number of state transitions during tokenization of statements like SELECT * FROM t1; . A diagram (omitted) illustrates the streamlined flow.

Summary

The state‑machine macros are defined in /mysql-8.0.20/include/sql_chars.h .

Fast character matching is achieved by the pre‑initialized init_state_maps method.

Token definitions can be found in /mysql-8.0.20/sql/sql_yacc.h .

Some characters, such as '*', are returned directly as their ASCII values.

debuggingState MachineMySQLSource CodeDatabase Internalslexical analysis
Xueersi Online School Tech Team
Written by

Xueersi Online School Tech Team

The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.