Build a File Upload & Search System with Elasticsearch and IK Analyzer

This guide walks through creating a file upload service that indexes Word, PDF, and TXT files in Elasticsearch, uses an ingest‑attachment pipeline to extract text, configures Chinese IK analyzers for precise keyword search, and demonstrates Java code for indexing, querying, and highlighting results.

Programmer DD
Programmer DD
Programmer DD
Build a File Upload & Search System with Elasticsearch and IK Analyzer

Elasticsearch Overview

Elasticsearch is an open‑source search engine built on top of Apache Lucene. It exposes a REST API, allowing you to send keyword queries and receive matching documents.

Development Environment

Install Elasticsearch, the elasticsearch-head UI, and kibana. All three are ready‑to‑run out of the box, but ensure the Kibana version matches the Elasticsearch version.

Core Requirements

Support file upload and download.

Search files by keyword, including the text inside Word, PDF, and TXT documents.

Why Elasticsearch

Elasticsearch provides fast full‑text search and a powerful ingest node that can preprocess documents before indexing.

Ingest Node & Attachment Processor

From Elasticsearch 5.x onward, the ingest node can run pipelines. The ingest-attachment plugin extracts text from binary files (PDF, Word, etc.) during indexing.

./bin/elasticsearch-plugin install ingest-attachment

Define the Attachment Pipeline

PUT /_ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    {"attachment": {"field": "content", "ignore_missing": true}},
    {"remove": {"field": "content"}}
  ]
}

Document Mapping

Create an index (e.g., docwrite) with a mapping that includes an attachment field for the extracted text and a name field analyzed by the IK analyzer.

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "name": {"type": "text", "analyzer": "ik_max_word"},
      "type": {"type": "keyword"},
      "attachment": {
        "properties": {
          "content": {"type": "text", "analyzer": "ik_smart"}
        }
      }
    }
  }
}

Encoding Files

Read a file, convert its bytes to Base64, and store the result in a FileObj Java class.

public class FileObj {
    String id;   // file id
    String name; // file name
    String type; // pdf, word, txt
    String content; // Base64‑encoded file content
}
public FileObj readFile(String path) throws IOException {
    File file = new File(path);
    FileObj obj = new FileObj();
    obj.setName(file.getName());
    obj.setType(file.getName().substring(file.getName().lastIndexOf('.') + 1));
    byte[] bytes = Files.readAllBytes(file.toPath());
    obj.setContent(Base64.getEncoder().encodeToString(bytes));
    return obj;
}

Indexing Files

Use the high‑level REST client to send an IndexRequest with the JSON representation of FileObj and specify the attachment pipeline.

public void upload(FileObj file) throws IOException {
    IndexRequest req = new IndexRequest("fileindex");
    req.source(JSON.toJSONString(file), XContentType.JSON);
    req.setPipeline("attachment");
    IndexResponse resp = client.index(req, RequestOptions.DEFAULT);
    System.out.println(resp);
}

Searching Files

Build a SearchRequest that queries attachment.content using the IK smart analyzer, and optionally add highlighting.

SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword)
    .analyzer("ik_smart"));
searchRequest.source(srb);
HighlightBuilder hb = new HighlightBuilder();
HighlightBuilder.Field hf = new HighlightBuilder.Field("attachment.content");
hb.field(hf);
hb.preTags("<em>");
hb.postTags("</em>");
srb.highlighter(hb);

Multi‑File Testing

Upload a folder containing various file types and verify the indexed documents via the elasticsearch-head UI.

Remaining Issues

File length: Elasticsearch truncates content longer than 100 k characters; further configuration is needed for larger texts.

Memory usage: Reading whole files into memory can exhaust resources for very large files; streaming approaches should be considered.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaElasticsearchfile uploadFull‑Text SearchIngest PipelineIK Analyzer
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.