Master MySQL Full‑Text Search: Index Creation, Modes, and Internals
This tutorial explains how MySQL implements full‑text search, covering the creation of full‑text indexes (including Chinese ngram support), the three query modes (natural language, boolean, and query expansion), relevance ranking, underlying inverted‑index structures, cache handling, and common DML operations.
This article introduces how to implement a search function using MySQL, covering practical steps and underlying principles.
Full‑Text Index Introduction
Since MySQL 5.6 the InnoDB storage engine supports full‑text indexes, and from 5.7 the ngram plugin adds Chinese support. Earlier versions only supported English because tokenization relied on spaces. MySQL allows full‑text indexes on CHAR, VARCHAR, and TEXT columns.
Full‑Text Index Usage
MySQL provides three full‑text search modes:
Natural language mode : use MATCH ... AGAINST with a plain string.
Boolean mode : supports operators such as + (must contain), - (must not contain), > (increase relevance), < (decrease relevance), * (wildcard), ~ (negative relevance), and quoted phrases. The current syntax can be viewed with:
mysql> show variables like '%ft_boolean_syntax%';
+-------------------+----------------+
| Variable_name | Value |
+-------------------+----------------+
| ft_boolean_syntax | +-><>()~*:""& |
+-------------------+----------------+Query expansion mode : useful for very short keywords; the engine performs a second search using terms that were highly relevant in the first pass.
Example table creation with a full‑text index using the ngram parser:
mysql> create table articles(
id int auto_increment primary key,
title varchar(200),
body text,
fulltext(title, body) with parser ngram
);When inserting data, character‑set errors may occur (e.g., ERROR 1366 (HY000): Incorrect string value). The fix is to convert the table to UTF‑8: alter table articles convert to charset utf8; Sample queries for each mode:
-- Natural language mode
select * from articles where match(title, body) against('MySQL数据库' in natural language mode);
-- Boolean mode
select * from articles where match(title, body) against('+数据 -管理' in boolean mode);
-- Query expansion mode
select * from articles where match(title, body) against('database' with query expansion);Relevance is calculated based on four factors: presence of the search term, term frequency, number of indexed columns containing the term, and the number of documents containing the term.
Underlying Implementation
The full‑text index is built as an inverted index. When a full‑text index exists, MySQL implicitly creates a column FTS_DOC_ID that maps each token to the document IDs where it appears. This enables fast lookup of documents by keyword.
The inverted index stores each word together with a list of (doc_id:position) pairs. The index data is kept in auxiliary tables on disk and in an in‑memory FTS Index Cache (a red‑black tree sorted by (word, ilist)).
FTS Index Cache
During transaction commit, tokens are written to the cache.
The cache is flushed to auxiliary tables in batches to improve performance.
If the cache fills (default 32 MiB), its size can be changed via innodb_ft_cache_size.
On shutdown, the cache is synchronized to disk.
During a crash, unsynced cache entries are re‑tokenized on restart.
Auxiliary Tables
Auxiliary tables store the inverted index on disk. Files such as FTS_000000000000005e_0000000000000087_INDEX_1.ibd … INDEX_6.ibd correspond to the six auxiliary tables that hold the word‑to‑document mappings.
DML Operations
Insert : tokens are added to the cache and later flushed to auxiliary tables.
Delete : tokens are removed from the cache; the corresponding entries remain in auxiliary tables until an OPTIMIZE TABLE is run.
Update : treated as a delete followed by an insert.
Search : first collects matching FTS_DOC_ID values, filters out deleted IDs, then retrieves rows ordered by relevance.
Viewing Tokenization
After inserting a row, the token cache can be inspected via the INNODB_FT_INDEX_CACHE table:
mysql> select * from INNODB_FT_INDEX_CACHE;
+------+------------+-----------+------+----------+--------+
| WORD | FIRST_DOC_ID | LAST_DOC_ID | DOC_COUNT | DOC_ID | POSITION |
+------+------------+-----------+------+----------+--------+
| hello | 2 | 2 | 1 | 2 | 0 |
| mysql | 2 | 2 | 1 | 1 | 18 |
| welcome | 2 | 2 | 1 | 2 | 7 |
| world | 2 | 2 | 1 | 2 | 24 |
+------+------------+-----------+------+----------+--------+References include MySQL technical books and online articles.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
21CTO
21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
