Databases 12 min read

Using MySQL Ngram Plugin to Enable Accurate Full‑Text Search for Chinese Text

This article explains why MySQL's default full‑text index struggles with Chinese, demonstrates how to configure token size parameters, activate the ngram parser plugin, and adjust queries (including Boolean mode) to achieve reliable Chinese full‑text search results.

Aikesheng Open Source Community

Jun 23, 2021

Using MySQL Ngram Plugin to Enable Accurate Full‑Text Search for Chinese Text

MySQL's default full‑text index processes text as a single‑byte stream, splitting tokens on spaces or punctuation, which works well for English but fails for Chinese because Chinese words are not separated by spaces. As a result, queries like match(s1) against ('我是') return empty sets even when matching rows exist.

The issue often stems from the token size limits defined by innodb_ft_min_token_size (default 3) and innodb_ft_max_token_size (default 84). When the search phrase does not meet these boundaries, MySQL does not generate index entries.

From MySQL 5.7 onward, the built‑in ngram full‑text parser plugin can handle Chinese by tokenizing fixed‑length character n‑grams. After confirming the plugin is active:

mysql> select * from information_schema.plugins where plugin_name = 'ngram'\G
*************************** 1. row ***************************
PLUGIN_NAME: ngram
PLUGIN_STATUS: ACTIVE
PLUGIN_TYPE: FTPARSER
LOAD_OPTION: ON

The ngram parser uses the system variable ngram_token_size (default 2) to define the length of each token. To switch a table to use ngram, alter the full‑text index with the WITH PARSER ngram clause:

mysql> alter table ft_ch drop key ft_s1, add fulltext ft_s1_n(s1) with parser ngram;

After re‑inserting data or rebuilding the index, searches for Chinese words return the expected rows. For example:

mysql> select * from ft_ch where match(s1) against ('中国');
+----+--------------------------------------+
| id | s1                                   |
+----+--------------------------------------+
| 1  | 我是中国人你是哪里人？                |
| 2  | 我是中国人，你是哪里人？              |
| 3  | 我是中国人 你是哪里人？               |
| 4  | 我是中国人你是哪里人                  |
+----+--------------------------------------+

When using the default natural‑language mode, MySQL splits the search term into its constituent n‑grams and combines results with OR logic, which can return unrelated rows. Switching to Boolean mode forces exact token matching:

mysql> select * from ft_ch where match(s1) against('中国人' in boolean mode);
+----+-----------------------------+
| id | s1                          |
+----+-----------------------------+
| 1  | 我是中国人，你呢？           |
+----+-----------------------------+

If a single‑character search is needed, the n‑gram size must be reduced (e.g., ngram_token_size=1) and the index rebuilt, after which queries like against('国' in boolean mode) succeed.

Overall, the ngram plugin provides a practical solution for Chinese full‑text search in MySQL, eliminating the need for manual tokenization or external search engines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

mysql tokenization Full‑Text Search Chinese Boolean Mode database indexing NGram

Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.