Backend Development 13 min read

Implementing File Upload and Keyword Search with Elasticsearch and Ingest Attachment Plugin

This article demonstrates how to use Elasticsearch, its ingest‑attachment plugin, and the IK analyzer to upload various file types, preprocess them, store them in an index, and perform accurate keyword searches with highlighting, providing complete Java code examples and configuration steps.

Top Architect

May 16, 2022

Implementing File Upload and Keyword Search with Elasticsearch and Ingest Attachment Plugin

The author, a senior architect, explains why Elasticsearch is chosen for a project that requires file upload, download, and precise keyword search across Word, PDF, and TXT documents. Elasticsearch wraps Lucene, offers REST APIs, and supports plugins for advanced features.

Elasticsearch Overview

Elasticsearch is an open‑source search engine built on Lucene, providing distributed storage and RESTful APIs. Plugins such as kibana and elasticsearch‑head are used for visualization and management.

Development Environment

Install Elasticsearch, elasticsearch‑head, and kibana. Ensure the Kibana version matches the Elasticsearch version. The default Elasticsearch port is 9200 and elasticsearch‑head runs on 9100.

Core Problems

The two main challenges are file upload and keyword query.

File Upload

Plain text files are straightforward, but PDF and Word files contain extra metadata that must be stripped. Elasticsearch 5.x+ provides an ingest node with pipelines to preprocess documents. The ingest‑attachment plugin extracts text from binary files.

./bin/elasticsearch-plugin install ingest-attachment

Define Ingest Pipeline

PUT /_ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    { "attachment": { "field": "content", "ignore_missing": true } },
    { "remove": { "field": "content" } }
  ]
}

Document Mapping

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "name": { "type": "text", "analyzer": "ik_max_word" },
      "type": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": "text", "analyzer": "ik_smart" }
        }
      }
    }
  }
}

After defining the pipeline and mapping, files are converted to Base64 and indexed.

Keyword Query

Elasticsearch’s default analyzer splits Chinese characters into single letters, which is not ideal. The IK analyzer provides two modes: ik_max_word (maximal segmentation) and ik_smart (smart segmentation). Using ik_smart yields desired tokens such as “进口” and “红酒”.

Install IK Analyzer

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/...your_version...

Search Example

GET /docwrite/_search
{
  "query": {
    "match": {
      "attachment.content": {
        "query": "实验一",
        "analyzer": "ik_smart"
      }
    }
  },
  "highlight": {
    "fields": { "attachment.content": {} },
    "pre_tags": ["<em>"],
    "post_tags": ["</em>"]
  }
}

The response includes highlighted snippets where the searched terms appear.

Encoding and Java Integration

Using IDEA + Maven, add the Elasticsearch high‑level REST client dependency matching the Elasticsearch version:

<dependency>
  <groupId>org.elasticsearch.client</groupId>
  <artifactId>elasticsearch-rest-high-level-client</artifactId>
  <version>7.9.1</version>
</dependency>

A FileObj class stores file metadata and Base64 content. The readFile method reads a file, encodes it, and returns a FileObj instance.

public class FileObj {
    String id; // file id
    String name; // file name
    String type; // pdf, word, txt
    String content; // Base64 encoded content
}

public FileObj readFile(String path) throws IOException {
    File file = new File(path);
    FileObj fileObj = new FileObj();
    fileObj.setName(file.getName());
    fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
    byte[] bytes = getContent(file);
    String base64 = Base64.getEncoder().encodeToString(bytes);
    fileObj.setContent(base64);
    return fileObj;
}

Upload uses IndexRequest with the pipeline:

public void upload(FileObj file) throws IOException {
    IndexRequest indexRequest = new IndexRequest("fileindex");
    indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
    indexRequest.setPipeline("attachment");
    IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
    System.out.println(response);
}

Search uses SearchRequest with the IK analyzer:

SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);

Multi‑File Testing

A demo folder containing various file types is uploaded and visualized via elasticsearch‑head. Search results demonstrate correct extraction and highlighting.

Remaining Issues

Elasticsearch truncates content longer than 100 k characters; further investigation is needed for large documents.

Reading entire files into memory can cause OOM for very large files; streaming or chunked processing should be considered for production.

The article concludes with additional resources and links to related open‑source projects.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Elasticsearch file-upload Ingest Pipeline ik-analyzer keyword-search

Written by

Top Architect

Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.