Backend Development 13 min read

Using Elasticsearch for File Upload, Indexing, and Keyword Search with Ingest Attachment Plugin

This article explains how to implement file upload, download, and precise keyword search for Word, PDF, and txt documents using Elasticsearch, covering environment setup, ingest‑attachment preprocessing, index mapping, Java code for uploading and querying, Chinese analysis with IK analyzer, and highlighting of results.

IT Architects Alliance

May 17, 2022

Using Elasticsearch for File Upload, Indexing, and Keyword Search with Ingest Attachment Plugin

The requirement is to support uploading and downloading files (Word, PDF, txt) and to enable precise keyword search within the file contents. Elasticsearch is chosen as the core search engine because it provides a simple REST API and powerful indexing capabilities.

Elasticsearch is an open‑source search engine built on Apache Lucene. It wraps Lucene to offer distributed storage and RESTful APIs. Commonly used plugins such as kibana and elasticsearch‑head provide visual interfaces for managing clusters.

Development environment: install Elasticsearch, elasticsearch‑head, and kibana. All three tools are "out‑of‑the‑box" and must have matching versions (e.g., Elasticsearch 7.9.1 with Kibana 7.9.1).

The core problems are file upload and keyword query. Plain text files are straightforward, but PDF and Word files contain extra metadata that must be stripped before indexing.

Elasticsearch 5.x+ offers an ingest node that can run a pipeline to preprocess documents. The ingest‑attachment plugin extracts text from binary files. Install it with:

./bin/elasticsearch-plugin install ingest-attachment

Define an ingest pipeline named attachment:

PUT /_ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    { "attachment": { "field": "content", "ignore_missing": true } },
    { "remove": { "field": "content" } }
  ]
}

Create an index with a mapping that includes the attachment field and uses the Chinese IK analyzer for full‑text search:

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "name": { "type": "text", "analyzer": "ik_max_word" },
      "type": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": "text", "analyzer": "ik_smart" }
        }
      }
    }
  }
}

Before indexing, files must be Base64‑encoded because Elasticsearch stores JSON documents. Convert the file to Base64, place the encoded string in the content field, and send the document using the pipeline:

IndexRequest indexRequest = new IndexRequest("fileindex");
indexRequest.source(JSON.toJSONString(fileObj), XContentType.JSON);
indexRequest.setPipeline("attachment");
client.index(indexRequest, RequestOptions.DEFAULT);

Keyword search uses the IK analyzer to obtain meaningful tokens. The default Unicode tokenizer splits Chinese characters individually, which is not desired. Installing the IK analyzer plugin enables two modes: ik_max_word – splits into the maximum number of tokens. ik_smart – splits according to common usage (e.g., "进口红酒" becomes "进口" and "红酒").

Install the IK analyzer:

./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/.../elasticsearch-analysis-ik-7.9.1.zip

Search example using ik_smart and highlighting:

SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", "实验一").analyzer("ik_smart"));
HighlightBuilder hb = new HighlightBuilder();
HighlightBuilder.Field hf = new HighlightBuilder.Field("attachment.content");
hb.field(hf);
hb.preTags("<em>");
hb.postTags("</em>");
srb.highlighter(hb);
searchRequest.source(srb);

Java helper classes for file handling:

public class FileObj {
  String id; // file id
  String name; // file name
  String type; // pdf, word, txt
  String content; // Base64 encoded content
}

public FileObj readFile(String path) throws IOException {
  File file = new File(path);
  FileObj obj = new FileObj();
  obj.setName(file.getName());
  obj.setType(file.getName().substring(file.getName().lastIndexOf('.') + 1));
  byte[] bytes = Files.readAllBytes(file.toPath());
  obj.setContent(Base64.getEncoder().encodeToString(bytes));
  return obj;
}

public void upload(FileObj file) throws IOException {
  IndexRequest req = new IndexRequest("fileindex");
  req.source(JSON.toJSONString(file), XContentType.JSON);
  req.setPipeline("attachment");
  client.index(req, RequestOptions.DEFAULT);
}

Search code using the IK analyzer:

SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);
SearchResponse resp = client.search(searchRequest, RequestOptions.DEFAULT);
for (SearchHit hit : resp.getHits()) {
  // process hit
}

Remaining challenges include Elasticsearch truncating documents longer than 100,000 characters and high memory consumption when loading large files entirely into memory, which may require streaming or chunked processing in production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Elasticsearch file-upload ik-analyzer Ingest Attachment keyword-search

Written by

IT Architects Alliance

Discussion and exchange on system, internet, large‑scale distributed, high‑availability, and high‑performance architectures, as well as big data, machine learning, AI, and architecture adjustments with internet technologies. Includes real‑world large‑scale architecture case studies. Open to architects who have ideas and enjoy sharing.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.