Artificial Intelligence 14 min read

How to Build a Custom HanLP Analyzer Plugin for Elasticsearch with Nginx

This guide walks through setting up a Java GraalVM 17 environment, installing Nginx to serve static dictionary files, configuring a HanLP‑based Elasticsearch analyzer plugin, packaging and deploying it, and testing the analyzer with JUnit5 and curl commands.

Programmer DD

Aug 30, 2022

How to Build a Custom HanLP Analyzer Plugin for Elasticsearch with Nginx

Preface

Project configuration

JAVA GraalVM 17

ElasticSearch 8.3.3

Junit5 5.9.0

lombok 1.8.24

logback 1.2.11

hanlp Chinese NLP toolkit 1.8.3

How to Use

Obtain HanLP corpus

Download data.zip directly from http://nlp.hankcs.com/download.php?file=data . The data will be used later.

Set up a local Nginx site to serve static content

Quickly install Nginx

On macOS: brew install nginx Locate the nginx.conf file as indicated by the installation output.

Configure Nginx

Edit nginx.conf according to the screenshot. The following configuration can be used as a baseline:

worker_processes  1;

events {
    worker_connections  1024;
}

http {
    include       mime.types;
    default_type  application/octet-stream;

    sendfile        on;
    keepalive_timeout  65;

    server {
        listen       8080;
        server_name  local.wujunshen.com;

        location / {
            root   /usr/local/var/www;
            index  index.html index.htm;
        }

        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }

    include servers/*;
}

Note the listen 8080; and server_name local.wujunshen.com settings.

Bind the domain to 127.0.0.1 by editing /etc/hosts: 127.0.0.1 local.wujunshen.com Apply the change with source /etc/hosts.

Restart Nginx: brew services restart nginx Verify the static site by opening http://local.wujunshen.com:8080/ in a browser; you should see the Nginx welcome page.

Then access the dictionary file at

http://local.wujunshen.com:8080/data/dictionary/custom/CustomDictionary.txt

to confirm static content is served.

Plugin Installation

Package the code mvn clean package Note the hanlp-hot-update.cfg.xml configuration file inside the package.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>HanLP extension configuration</comment>
    <!-- User can configure remote extension dictionary here -->
    <entry key="remote_ext_dict">http://local.wujunshen.com:8080/data/dictionary/custom/CustomDictionary.txt</entry>
    <!-- User can configure remote stopword dictionary here -->
    <entry key="remote_ext_stopwords">http://local.wujunshen.com:8080/data/dictionary/stopwords.txt</entry>
    <!-- User can configure remote synonym dictionary here -->
    <entry key="remote_ext_synonyms">http://local.wujunshen.com:8080/data/dictionary/synonym/CoreSynonym.txt</entry>
</properties>

These entries point to the static files served by Nginx.

Publish the plugin

Unzip the generated .zip plugin package (found in /target/releases) into the plugins directory of Elasticsearch.

Run the plugin

Restart Elasticsearch and start it from the bin directory.

Note: Do not start Elasticsearch with the root account. Create a non‑root user, grant necessary permissions, and run Elasticsearch under that user.

Check Plugin Execution Result

Use the Kibana "Dev Tools" console (or any HTTP client) to send an analyze request:

curl -H "Content-Type:application/json" -X POST -d '{
"analyzer": "hanlp_synonym",
"text": "英特纳雄耐尔"
}' http://localhost:9200/index-test/_analyze?pretty=true

The response shows tokenization results.

Development Overview

ES Analyzer Overview

Elasticsearch provides a standard English analyzer, but non‑English languages need custom analyzers. A typical analyzer consists of an Analyzer, a Tokenizer, and TokenFilters, all driven by underlying tokenization algorithms.

This project uses the HanLP algorithm, a leading Chinese NLP toolkit.

HanLP Brief Introduction

HanLP is a collection of models and algorithms for natural language processing, offering high performance, clear architecture, and extensibility. It supports dictionary‑based and model‑based tokenization, custom dictionaries, and hot‑update of dictionaries.

Project Code Structure

assemblies

: plugin packaging configuration ( plugin.xml) com.wujunshen.core: core analyzer classes com.wujunshen.dictionary: synonym dictionary classes com.wujunshen.enumation: related enums com.wujunshen.exception: custom exceptions com.wujunshen.nature: natural segmentation attributes com.wujunshen.plugin: plugin definition com.wujunshen.update: hot‑update handling com.wujunshen.utils: utility classes resources: plugin resources, including HanLP configuration, Java security policy, logback settings, etc. test: unit tests using JUnit5 (e.g., MyAnalyzerTest)

Unit Test Example

The private method analyze in the test class demonstrates how to use the analyzer:

private List<Token> analyze(SegmentationType segmentationType, String text) throws IOException {
    Tokens result = new Tokens();
    List<Token> resultList = new ArrayList<>();
    Analyzer analyzer = new MyAnalyzer(segmentationType);
    TokenStream tokenStream = analyzer.tokenStream("text", text);

    tokenStream.reset();
    while (tokenStream.incrementToken()) {
        CharTermAttribute charTermAttribute = tokenStream.getAttribute(CharTermAttribute.class);
        TypeAttribute typeAttribute = tokenStream.getAttribute(TypeAttribute.class);
        OffsetAttribute offsetAttribute = tokenStream.getAttribute(OffsetAttribute.class);
        PositionIncrementAttribute positionIncrementAttribute = tokenStream.getAttribute(PositionIncrementAttribute.class);
        Token token = new Token();
        token.setToken(charTermAttribute.toString());
        token.setStartOffset(offsetAttribute.startOffset());
        token.setEndOffset(offsetAttribute.endOffset());
        token.setType(typeAttribute.type());
        token.setPosition(positionIncrementAttribute.getPositionIncrement());
        resultList.add(token);
    }
    tokenStream.close();
    result.setTokens(resultList);
    objectMapper.enable(SerializationFeature.INDENT_OUTPUT);
    log.info("{}
", objectMapper.writeValueAsString(result));
    return resultList;
}

The TokenStream workflow consists of instantiation, reset, repeated incrementToken, and close calls.

Security Policy File

The plugin-security.policy file must be placed under resources. Without it, Elasticsearch may throw AccessControlException when loading remote dictionaries.

Additional Initialization Code

Static initialization in MyTokenizer.java sets up security permissions and loads HanLP segments with various recognizers:

static {
    SecurityManager sm = System.getSecurityManager();
    if (sm != null) {
        sm.checkPermission(new SpecialPermission());
    }
    AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
        Nature.create("auxiliary");
        return null;
    });
    AccessController.doPrivileged((PrivilegedAction<Void>) () -> {
        nlpSegment = HanLP.newSegment()
            .enablePartOfSpeechTagging(true)
            .enableOffset(true)
            .enableNameRecognize(true)
            .enableJapaneseNameRecognize(true)
            .enableNumberQuantifierRecognize(true)
            .enableOrganizationRecognize(true)
            .enableTranslatedNameRecognize(true);
        indexSegment = HanLP.newSegment()
            .enableIndexMode(true)
            .enablePartOfSpeechTagging(true)
            .enableOffset(true);
        log.info(String.valueOf(nlpSegment.seg("HanLP中文分词工具包！")));
        log.info(String.valueOf(indexSegment.seg("HanLP中文分词工具包！")));
        return null;
    });
}

Summary

Three built‑in tokenization modes for different scenarios (indexing, NLP, synonym indexing).

Support for external dictionaries via a static Nginx site.

Custom dictionary support at the analyzer level.

Remote dictionary hot‑update capability.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Elasticsearch Plugin Tokenization NLP HanLP

Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.