Databases 23 min read

Inside MySQL InnoDB Full-Text Index: Architecture, Operations, and Optimization

This article examines MySQL’s InnoDB full‑text indexing from MySQL 5.6 onward, detailing supported search modes, the structure of auxiliary index files, the lifecycle of DML operations, transaction handling, cache synchronization, optimization procedures, background threads, monitoring tables, stop‑word configuration, and the built‑in n‑gram parser.

21CTO
21CTO
21CTO
Inside MySQL InnoDB Full-Text Index: Architecture, Operations, and Optimization
Abstract: Since MySQL 5.6, InnoDB supports full‑text indexes, largely compatible with MyISAM syntax. Full‑text index uses an inverted index to quickly match documents.

Preface

MySQL 5.6 introduced InnoDB full‑text indexing. The syntax is mostly compatible with the previous MyISAM full‑text mode. Full‑text indexing builds an inverted index for fast document matching. MySQL supports three full‑text search modes:

1. Natural language mode (IN NATURAL LANGUAGE MODE) – use MATCH AGAINST with a specific string.

2. Boolean mode (IN BOOLEAN MODE) – allows operators such as "+" (must contain), "-" (must not contain), "*" (wildcard). The syntax can be inspected with the variable ft_boolean_syntax:

mysql> show variables like '%ft_boolean_syntax%';
+-------------------+----------------+
| Variable_name     | Value          |
+-------------------+----------------+
| ft_boolean_syntax | + -<>()~*:""&| |
+-------------------+----------------+
1 row in set (0.00 sec)

3. Query expansion mode (WITH QUERY EXPANSION) – a variant of natural language mode that runs two searches: first with the given phrase, then with rows that have high relevance.

Full‑text indexes can be defined on CHAR, VARCHAR, and TEXT columns. This article provides a brief analysis of the code modules involved and highlights new features in MySQL 5.7, based on source version 5.7.8‑rc.

Create Full‑Text Index

Example of creating a table with a full‑text index:

create table t1 (a int auto_increment primary key, b text, fulltext(b));

On disk the following files are generated:

$ls -lh /u01/my57/data/test/
 total 1.3M
 FTS_000000000000010b_0000000000000154_INDEX_1.ibd
 FTS_000000000000010b_0000000000000154_INDEX_2.ibd
 FTS_000000000000010b_0000000000000154_INDEX_3.ibd
 FTS_000000000000010b_0000000000000154_INDEX_4.ibd
 FTS_000000000000010b_0000000000000154_INDEX_5.ibd
 FTS_000000000000010b_0000000000000154_INDEX_6.ibd
 FTS_000000000000010b_BEING_DELETED_CACHE.ibd
 FTS_000000000000010b_BEING_DELETED.ibd
 FTS_000000000000010b_CONFIG.ibd
 FTS_000000000000010b_DELETED_CACHE.ibd
 FTS_000000000000010b_DELETED.ibd
 t1.frm
 t1.ibd

a) Files FTS_..._INDEX_1~6.ibd store the inverted index (tokens, positions, document IDs) partitioned by the first character of the token.

File naming rule: FTS_{TABLE_ID}_{INDEX_ID}_INDEX_{N}.ibd b) FTS_..._DELETED.ibd holds document IDs that have been deleted but not yet removed from the index; FTS_..._DELETED_CACHE.ibd is its memory cache.

c) FTS_..._BEING_DELETED_CACHE.ibd and FTS_..._BEING_DELETED.ibd contain document IDs that are being removed; the former is the in‑memory version.

d) FTS_..._CONFIG.ibd stores internal index information, most importantly FTS_SYNCED_DOC_ID, which indicates the highest document ID flushed to disk.

Full‑text index auxiliary functions (simplified):

ha_innobase::create
    |---> create_table_info_t::create_table
            |---> fts_create_common_tables

When creating a full‑text index on an existing table, InnoDB uses multiple threads for concurrent construction, controlled by the parameter innodb_ft_sort_pll_degree. During restore, it is recommended to create the table, load data, then add the full‑text index.

Reference functions: row_merge_read_clustered_indexrow_fts_start_psort; thread callback is fts_parallel_tokenization.

When a table has a full‑text index, an implicit column FTS_DOC_ID is created with a unique index to identify token occurrences. Users may create an explicit FTS_DOC_ID column, but its type must match the implicit one.

The full‑text module defines many classes for management (illustrated below):

FTS architecture diagram
FTS architecture diagram

Regular DML and Query Operations

Insert

Insert a row and inspect the token cache:

mysql> insert into t1 values (NULL, 'hello, welcome to mysql world');
Query OK, 1 row affected (1.87 sec)

mysql> set global innodb_ft_aux_table = 'test/t1';
Query OK, 0 rows affected (0.00 sec)

mysql> select * from INNODB_FT_INDEX_CACHE;
+------+----------------+-----------+----------+--------+----------+
| WORD | FIRST_DOC_ID   | LAST_DOC_ID| DOC_COUNT| DOC_ID | POSITION |
+------+----------------+-----------+----------+--------+----------+
| hello| 2              | 2         | 1        | 2      | 0        |
| mysql| 2              | 2         | 1        | 2      | 18       |
| welcome|2            | 2         | 1        | 2      | 7        |
| world| 2              | 2         | 1        | 2      | 24       |
+------+----------------+-----------+----------+--------+----------+

Insertion stack:

row_insert_for_mysql
    |---> row_insert_for_mysql_using_ins_graph
            |---> fts_trx_add_op  // state = FTS_INSERT

After inserting a row, InnoDB checks for a full‑text index (DICT_TF2_FTS). If present, it extracts the document ID via fts_get_doc_id_from_row and caches it in the transaction object.

Delete

Delete operations do not immediately remove entries from the full‑text index, so tokens remain visible in INNODB_FT_INDEX_CACHE:

ha_innobase::delete_row
    |- -> row_update_for_mysql
            |- -> row_update_for_mysql_using_upd_graph
                |- -> row_fts_update_or_delete
                    |- -> fts_trx_add_op // state = FTS_DELETE

Update

Updating a non‑indexed column does not affect FTS_DOC_ID. Updating an indexed column causes the old document to be deleted and a new one inserted:

ha_innobase::update_row
    |---> row_update_for_mysql
            |---> row_update_for_mysql_using_upd_graph
                |---> row_fts_update_or_delete
                    |---> row_fts_do_update
                        |---> fts_trx_add_op // state = FTS_DELETE
                        |---> fts_trx_add_op // state = FTS_INSERT

All DML operations funnel through fts_trx_add_op, which records either FTS_INSERT or FTS_DELETE. The affected document IDs are stored in trx->fts_trx->savepoints for global handling and in trx->fts_trx->last_stmt for the current statement; the former is processed at transaction end, the latter cleared after the statement.

Query

Full‑text queries use a new interface in two steps:

1. Collect matching document IDs based on the search terms.

JOIN::optimize
    |- -> init_ftfuncs
        |- -> Item_func_match::init_search
            |- -> ha_innobase::ft_init_ext
                |- -> fts_query

During collection, IDs from the DELETED table are read first and later used as filters.

2. Retrieve the rows using the hidden unique index on FTS_DOC_ID:

sub_select
    |---> join_ft_read_first
            |---> ha_innobase::ft_init
            |---> ha_innobase::ft_read
    |---> join_ft_read_next
            |---> ha_innobase::ft_read

Results are ranked using a BM25/TF‑IDF algorithm similar to Sphinx. Ranking formula:

IDF = log10(total_records / matching_records)
TF = term frequency in a document
rank = TF * IDF * IDF

Functions involved: fts_query_calculate_idf and fts_query_calculate_ranking. When multiple terms match, their ranks are summed.

Transaction Operations

Rollback within a transaction

innobase_rollback --> trx_rollback_last_sql_stat_for_mysql
    |---> fts_savepoint_rollback_last_stmt
        |---> fts_undo_last_stmt
    |---> trx_mark_sql_stat_end
        |---> fts_savepoint_laststmt_refresh

Rollback to a savepoint

innobase_rollback_to_savepoint  |---> fts_savepoint_rollback

Transaction commit

trx_commit_low
    |- -> fts_commit  // handle trx->fts_trx->savepoints
        |- -> fts_commit_table
            |- -> fts_add
            |- -> fts_add_doc_by_id
            |- -> fts_delete
    |- -> trx_commit_in_memory
        |- -> trx_finalize_for_fts
            |- -> trx_finalize_for_fts_table

During commit, fts_add adds new index entries and fts_delete removes obsolete ones.

To reduce write amplification, InnoDB caches insert operations in memory (size controlled by innodb_ft_cache_size). When the cache exceeds the limit or the global limit ( innodb_ft_total_cache_size) is reached, a sync flushes data to disk.

Avoid duplicate storage of identical words

Cache size is controlled by innodb_ft_cache_size Queries merge cache data with on‑disk data

Steps performed by fts_add_doc_by_id:

Locate the record using its doc_id via the hidden FTS_DOC_ID index.

Iterate over the clustered index, build an fts_doc_t, and tokenize the document (using fts_tokenize_document or fts_tokenize_document_next).

Add the resulting tokens to the cache via fts_cache_add_doc.

If the cache exceeds innodb_ft_cache_size (or the global limit), invoke fts_sync to write the cache to the full‑text files and then clear the cache.

Delete operations similarly create entries in the DELETED auxiliary table via fts_delete.

Key transaction‑related classes (illustrated):

Transaction FTS classes
Transaction FTS classes

Sync Cache

Full‑text indexes sync to disk under several conditions:

Cache memory exceeds the configured limit.

The background thread fts_optimize_thread shuts down, triggering a sync for all tables. ha_innobase::optimize is called (OPTIMIZE TABLE). row_merge_read_clustered_index creates a temporary table and initiates a sync.

The sync entry point is fts_sync, which processes each index via fts_sync_index. The index file is written using fts_write_node, and the node structure is shown below:

FTS node structure
FTS node structure

After syncing, fts_sync_commit updates the CONFIG table with the highest synced doc ID, moves any pending deletions from the cache to the DELETED_CACHE table, and clears the cache.

Optimize Table

When configuration changes (e.g., token size) require rebuilding the full‑text index, OPTIMIZE TABLE can be used. The parameter innodb_optimize_fulltext_only limits the operation to the full‑text index only.

ha_innobase::optimize
    |---> fts_sync_table
    |---> fts_optimize_table

During fts_optimize_table the following steps occur:

If the BEING_DELETED table is empty, snapshot the DELETED table into BEING_DELETED and BEING_DELETED_CACHE using SQL like:

static const char* fts_init_delete_sql = "BEGIN
"
"
"
"INSERT INTO $BEING_DELETED
"
"SELECT doc_id FROM $DELETED;
"
"
"
"INSERT INTO $BEING_DELETED_CACHE
"
"SELECT doc_id FROM $DELETED_CACHE;
";

Read the snapshot of deleted doc IDs from BEING_DELETED / BEING_DELETED_CACHE (function fts_optimize_read_deleted_doc_id_snapshot).

Optimize each index ( fts_optimize_index) which reads words, removes deleted doc IDs, and rewrites the index.

After all indexes are processed, purge the snapshot tables using SQL such as:

static const char* fts_delete_doc_ids_sql = "BEGIN
"
"
"
"DELETE FROM $DELETED WHERE doc_id = :doc_id1;
"
"DELETE FROM $DELETED_CACHE WHERE doc_id = :doc_id2;
";
static const char* fts_end_delete_sql = "BEGIN
"
"
"
"DELETE FROM $BEING_DELETED;
"
"DELETE FROM $BEING_DELETED_CACHE;
";

Background Thread

InnoDB starts a background thread fts_optimize_thread with work queue fts_optimize_wq. It adds entries to the queue when a table with a full‑text index is created ( FTS_MSG_ADD_TABLE) or when a table is dropped/altered ( FTS_MSG_DEL_TABLE). When the number of deleted rows exceeds a threshold (default 10 million) and at least 300 seconds have passed, the thread triggers an automatic OPTIMIZE TABLE.

Monitoring

Full‑text index status can be monitored via INFORMATION_SCHEMA tables:

mysql> show tables like '%ft%';
+-------------------------------------+
| Tables_in_information_schema (%ft%) |
+-------------------------------------+
| INNODB_FT_CONFIG                     |
| INNODB_FT_BEING_DELETED              |
| INNODB_FT_DELETED                   |
| INNODB_FT_DEFAULT_STOPWORD          |
| INNODB_FT_INDEX_TABLE                |
| INNODB_FT_INDEX_CACHE                |
+-------------------------------------+

Set the variable innodb_ft_aux_table to "dbname/tablename" before querying these tables.

Full‑Text Stop Words

Stop‑word lists are stored in INNODB_FT_DEFAULT_STOPWORD. Users can create a custom stop‑word table (identical schema) and set innodb_ft_server_stopword_table to point to it. Session‑level overrides are possible with innodb_ft_user_stopword_table. Token length limits are controlled by innodb_ft_min_token_size and innodb_ft_max_token_size; when using the n‑gram parser, ngram_token_size applies.

InnoDB Full‑Text Index Plugins

Since MySQL 5.7.3, InnoDB supports full‑text index plugins. Plugins can replace the built‑in parser or act as collaborators, allowing custom tokenization rules for specialized domains.

InnoDB N‑gram Parser

MySQL 5.7.6 introduced a built‑in n‑gram parser for CJK character sets. It can be enabled with WITH PARSER ngram:

mysql> create table ft_test(id int, content text, fulltext (content) with parser ngram);
Query OK, 0 rows affected (0.26 sec)

Example tokenization of the string "abcd" with ngram_token_size=2:

N=1: 'a', 'b', 'c', 'd';
N=2: 'ab', 'bc', 'cd';
N=3: 'abc', 'bcd';
N=4: 'abcd';

Unlike the built‑in parser, the n‑gram parser treats any token containing a stop word as non‑indexable, and spaces are always considered stop words.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

optimizationInnoDBmysqlFull‑Text SearchDatabase InternalsN-gram Parser
21CTO
Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.