How to Install and Use the IK Chinese Analyzer Plugin in Elasticsearch
This article explains why Elasticsearch's built‑in tokenizers struggle with Chinese text, introduces the IK analyzer plugin, provides step‑by‑step Docker and file‑based installation methods, shows how to configure custom dictionaries via Nginx, and demonstrates smart and max‑word tokenization queries.
Elasticsearch's built‑in tokenizers do not handle Chinese well, so searching Chinese terms such as “悟空哥” fails.
1. Tokenizer principles in Elasticsearch
1.1 Tokenizer concept
A tokenizer receives a character stream and splits it into tokens, which can be combined into custom analyzers.
1.2 Standard tokenizer
The standard tokenizer splits on whitespace and records token positions and offsets, useful for phrase queries and highlighting.
1.3 English and punctuation example
POST _analyze
{
"analyzer": "standard",
"text": "Do you know why I want to study ELK? 2 3 33..."
}Result:
do, you, know, why, i, want, to, study, elk, 2,3,331.4 Chinese tokenization example
POST _analyze
{
"analyzer": "standard",
"text": "悟空聊架构"
}The standard tokenizer splits each Chinese character separately, producing 悟, 空, 聊, 架, 构 instead of the desired words.
2. Installing the IK Chinese analyzer plugin
2.1 Plugin source
https://github.com/medcl/elasticsearch-analysis-ik/releasesMatch the plugin version with the Elasticsearch version (e.g., 7.4.2).
{
"name" : "8448ec5f3312",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "xC72O3nKSjWavYZ-EPt9Gw",
"version" : {
"number" : "7.4.2",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "2f90bbf7b93631e52bafb59b3b049cb44ec25e96",
"build_date" : "2019-10-28T20:40:44.881551Z",
"build_snapshot" : false,
"lucene_version" : "8.2.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}2.2 Installation methods
2.2.1 Inside the Elasticsearch container
Enter the container:
docker exec -it
/bin/bashDownload the plugin zip:
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zipUnzip and clean up:
unzip ELK-IKv7.4.2.zip -d ./ik
chmod -R 777 ik/
rm -rf *.zip2.2.2 Via a mapping directory
Copy the zip to the plugins folder:
cd /mydata/elasticsearch/plugins
wget https://github.com/medcl/elasticsearch-analysis-ik/releases/download/v7.4.2/elasticsearch-analysis-ik-7.4.2.zip
unzip ELK-IKv7.4.2.zip -d ./ik
rm -rf *.zip2.2.3 Upload with Xftp
Use XShell/Xftp to copy the zip into the container, then unzip as above.
3. Verifying the installation
docker exec -it
/bin/bash
elasticsearch-plugin listThe command should output ik , confirming the plugin is installed. Restart the container:
exit
docker restart elasticsearch4. Using the IK analyzer
The plugin provides two modes: ik_smart (intelligent) and ik_max_word (maximum word segmentation).
Smart mode example
POST _analyze
{
"analyzer": "ik_smart",
"text": "一颗小星星"
}Result: “一颗”, “小星星”.
Max‑word mode example
POST _analyze
{
"analyzer": "ik_max_word",
"text": "一颗小星星"
}Result: “一颗”, “一”, “颗”, “小星星”, “小星”, “星星”.
5. Custom dictionary
To keep terms like “悟空哥” intact, add them to a custom dictionary and reference it in IKAnalyzer.cfg.xml (path: /usr/share/elasticsearch/plugins/ik/config/IKAnalyzer.cfg.xml ).
<?xml version="1.0" encoding="UTF-8"?>
IK Analyzer 扩展配置
custom/mydict.dic;custom/single_word_low_freq.dic
custom/ext_stopword.dic
location
http://xxx.com/xxx.dicPlace a file (e.g., ik.txt ) on a remote Nginx server and set remote_ext_dict to its URL.
Deploying Nginx for remote dictionary
docker run -p 80:80 --name nginx -d nginx:1.10
docker container cp nginx:/etc/nginx ./conf
mkdir nginx
mv conf nginx/
docker stop nginx
docker rm
docker run -p 80:80 --name nginx \
-v /mydata/nginx/html:/usr/share/nginx/html \
-v /mydata/nginx/logs:/var/log/nginx \
-v /mydata/nginx/conf:/etc/nginx \
-d nginx:1.10Create ik.txt containing “悟空哥” and make it accessible at http://192.168.56.10/ik/ik.txt . After updating IKAnalyzer.cfg.xml to point to this URL, restart Elasticsearch:
docker restart elasticsearch
docker update elasticsearch --restart=alwaysNow a query for “悟空哥聊架构” yields the three tokens “悟空哥”, “聊”, “架构”.
- END -
Wukong Talks Architecture
Explaining distributed systems and architecture through stories. Author of the "JVM Performance Tuning in Practice" column, open-source author of "Spring Cloud in Practice PassJava", and independently developed a PMP practice quiz mini-program.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.