How to Build a Custom HanLP Analyzer Plugin for Elasticsearch with Nginx
This guide walks through setting up a Java GraalVM 17 environment, installing Nginx to serve static dictionary files, configuring a HanLP‑based Elasticsearch analyzer plugin, packaging and deploying it, and testing the analyzer with JUnit5 and curl commands.
Preface
Project configuration
JAVA GraalVM 17
ElasticSearch 8.3.3
Junit5 5.9.0
lombok 1.8.24
logback 1.2.11
hanlp Chinese NLP toolkit 1.8.3
How to Use
Obtain HanLP corpus
Download data.zip directly from http://nlp.hankcs.com/download.php?file=data . The data will be used later.
Set up a local Nginx site to serve static content
Quickly install Nginx
On macOS: brew install nginx Locate the nginx.conf file as indicated by the installation output.
Configure Nginx
Edit nginx.conf according to the screenshot. The following configuration can be used as a baseline:
worker_processes 1;
events {
worker_connections 1024;
}
http {
include mime.types;
default_type application/octet-stream;
sendfile on;
keepalive_timeout 65;
server {
listen 8080;
server_name local.wujunshen.com;
location / {
root /usr/local/var/www;
index index.html index.htm;
}
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root html;
}
}
include servers/*;
}Note the listen 8080; and server_name local.wujunshen.com settings.
Bind the domain to 127.0.0.1 by editing /etc/hosts: 127.0.0.1 local.wujunshen.com Apply the change with source /etc/hosts.
Restart Nginx: brew services restart nginx Verify the static site by opening http://local.wujunshen.com:8080/ in a browser; you should see the Nginx welcome page.
Then access the dictionary file at
http://local.wujunshen.com:8080/data/dictionary/custom/CustomDictionary.txtto confirm static content is served.
Plugin Installation
Package the code mvn clean package Note the hanlp-hot-update.cfg.xml configuration file inside the package.
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
<comment>HanLP extension configuration</comment>
<!-- User can configure remote extension dictionary here -->
<entry key="remote_ext_dict">http://local.wujunshen.com:8080/data/dictionary/custom/CustomDictionary.txt</entry>
<!-- User can configure remote stopword dictionary here -->
<entry key="remote_ext_stopwords">http://local.wujunshen.com:8080/data/dictionary/stopwords.txt</entry>
<!-- User can configure remote synonym dictionary here -->
<entry key="remote_ext_synonyms">http://local.wujunshen.com:8080/data/dictionary/synonym/CoreSynonym.txt</entry>
</properties>These entries point to the static files served by Nginx.
Publish the plugin
Unzip the generated .zip plugin package (found in /target/releases) into the plugins directory of Elasticsearch.
Run the plugin
Restart Elasticsearch and start it from the bin directory.
Note: Do not start Elasticsearch with the root account. Create a non‑root user, grant necessary permissions, and run Elasticsearch under that user.
Check Plugin Execution Result
Use the Kibana "Dev Tools" console (or any HTTP client) to send an analyze request:
curl -H "Content-Type:application/json" -X POST -d '{
"analyzer": "hanlp_synonym",
"text": "英特纳雄耐尔"
}' http://localhost:9200/index-test/_analyze?pretty=trueThe response shows tokenization results.
Development Overview
ES Analyzer Overview
Elasticsearch provides a standard English analyzer, but non‑English languages need custom analyzers. A typical analyzer consists of an Analyzer, a Tokenizer, and TokenFilters, all driven by underlying tokenization algorithms.
This project uses the HanLP algorithm, a leading Chinese NLP toolkit.
HanLP Brief Introduction
HanLP is a collection of models and algorithms for natural language processing, offering high performance, clear architecture, and extensibility. It supports dictionary‑based and model‑based tokenization, custom dictionaries, and hot‑update of dictionaries.
Project Code Structure
assemblies: plugin packaging configuration ( plugin.xml) com.wujunshen.core: core analyzer classes com.wujunshen.dictionary: synonym dictionary classes com.wujunshen.enumation: related enums com.wujunshen.exception: custom exceptions com.wujunshen.nature: natural segmentation attributes com.wujunshen.plugin: plugin definition com.wujunshen.update: hot‑update handling com.wujunshen.utils: utility classes resources: plugin resources, including HanLP configuration, Java security policy, logback settings, etc. test: unit tests using JUnit5 (e.g., MyAnalyzerTest)
Unit Test Example
The private method analyze in the test class demonstrates how to use the analyzer:
private List<Token> analyze(SegmentationType segmentationType, String text) throws IOException {
Tokens result = new Tokens();
List<Token> resultList = new ArrayList<>();
Analyzer analyzer = new MyAnalyzer(segmentationType);
TokenStream tokenStream = analyzer.tokenStream("text", text);
tokenStream.reset();
while (tokenStream.incrementToken()) {
CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
TypeAttribute typeAttribute = tokenStream.getAttribute(TypeAttribute.class);
OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
Token token = new Token();
token.setToken(charTermAttribute.toString());
token.setStartOffset(offsetAttribute.startOffset());
token.setEndOffset(offsetAttribute.endOffset());
token.setType(typeAttribute.type());
token.setPosition(positionIncrementAttribute.getPositionIncrement());
resultList.add(token);
}
tokenStream.close();
result.setTokens(resultList);
objectMapper.enable(SerializationFeature.INDENT_OUTPUT);
log.info("{}
", objectMapper.writeValueAsString(result));
return resultList;
}The TokenStream workflow consists of instantiation, reset, repeated incrementToken, and close calls.
Security Policy File
The plugin-security.policy file must be placed under resources. Without it, Elasticsearch may throw AccessControlException when loading remote dictionaries.
Additional Initialization Code
Static initialization in MyTokenizer.java sets up security permissions and loads HanLP segments with various recognizers:
static {
SecurityManager sm = System.getSecurityManager();
if (sm != null) {
sm.checkPermission(new SpecialPermission());
}
AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
Nature.create("auxiliary");
return null;
});
AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
nlpSegment = HanLP.newSegment()
.enablePartOfSpeechTagging(true)
.enableOffset(true)
.enableNameRecognize(true)
.enableJapaneseNameRecognize(true)
.enableNumberQuantifierRecognize(true)
.enableOrganizationRecognize(true)
.enableTranslatedNameRecognize(true);
indexSegment = HanLP.newSegment()
.enableIndexMode(true)
.enablePartOfSpeechTagging(true)
.enableOffset(true);
log.info(String.valueOf(nlpSegment.seg("HanLP中文分词工具包!")));
log.info(String.valueOf(indexSegment.seg("HanLP中文分词工具包!")));
return null;
});
}Summary
Three built‑in tokenization modes for different scenarios (indexing, NLP, synonym indexing).
Support for external dictionaries via a static Nginx site.
Custom dictionary support at the analyzer level.
Remote dictionary hot‑update capability.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
