How to Implement Full-Text Search for Word, PDF, and TXT Files with Elasticsearch

This guide explains how to upload Word, PDF, and TXT files, preprocess them with Elasticsearch ingest pipelines, index their content using appropriate analyzers, and perform accurate keyword searches with highlighting, providing complete Java code examples and configuration steps.

Programmer DD
Programmer DD
Programmer DD
How to Implement Full-Text Search for Word, PDF, and TXT Files with Elasticsearch

Elasticsearch Overview

Elasticsearch is an open‑source search engine built on Apache Lucene that exposes a REST API for indexing and querying documents. It wraps Lucene’s complexity and adds distributed storage, making it suitable for full‑text search across various file types.

Development Environment

Install Elasticsearch, Kibana, and elasticsearch‑head. Ensure Kibana’s version matches the Elasticsearch version.

Core Problems

The two main challenges are file upload (including preprocessing for PDF/Word) and keyword querying.

File Upload

Plain text files can be indexed directly, but PDF and Word files contain extra metadata that must be stripped. Elasticsearch 5.x+ provides an ingest node with the ingest‑attachment plugin to extract text from these formats.

./bin/elasticsearch-plugin install ingest-attachment

Define Ingest Pipeline

PUT /_ingest/pipeline/attachment
{
  "description": "Extract attachment information",
  "processors": [
    { "attachment": { "field": "content", "ignore_missing": true } },
    { "remove": { "field": "content" } }
  ]
}

Define Index Mapping

PUT /docwrite
{
  "mappings": {
    "properties": {
      "id": { "type": "keyword" },
      "name": { "type": "text", "analyzer": "ik_max_word" },
      "type": { "type": "keyword" },
      "attachment": {
        "properties": {
          "content": { "type": "text", "analyzer": "ik_smart" }
        }
      }
    }
  }
}

Encoding Files in Java

Read a file, convert its bytes to Base64, and store the result in a FileObj object.

public class FileObj {
    String id;   // file id
    String name; // file name
    String type; // pdf, word, or txt
    String content; // Base64‑encoded file content
}

public FileObj readFile(String path) throws IOException {
    File file = new File(path);
    FileObj fileObj = new FileObj();
    fileObj.setName(file.getName());
    fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
    byte[] bytes = getContent(file);
    String base64 = Base64.getEncoder().encodeToString(bytes);
    fileObj.setContent(base64);
    return fileObj;
}

Uploading to Elasticsearch

public void upload(FileObj file) throws IOException {
    IndexRequest indexRequest = new IndexRequest("fileindex");
    indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
    indexRequest.setPipeline("attachment");
    IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
    System.out.println(response);
}

Keyword Query

Use the IK smart analyzer to split Chinese text into meaningful tokens and enable highlighting.

SearchSourceBuilder srb = new SearchSourceBuilder();
 srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
 searchRequest.source(srb);

HighlightBuilder hb = new HighlightBuilder();
 HighlightBuilder.Field hf = new HighlightBuilder.Field("attachment.content");
 hb.field(hf);
 hb.preTags("<em>");
 hb.postTags("</em>");
 srb.highlighter(hb);

Testing and Multi‑File Upload

Upload multiple files, view them in Kibana or elasticsearch‑head, and run search queries to verify that extracted text is searchable and highlighted correctly.

Remaining Issues

Elasticsearch truncates content longer than 100 000 characters; further investigation is needed for larger texts.

Reading entire files into memory can cause out‑of‑memory errors for very large files; streaming or chunked processing may be required in production.

Elasticsearch main page
Elasticsearch main page
Ingest node pipeline
Ingest node pipeline
Pipeline execution result
Pipeline execution result
Test document
Test document
File upload test
File upload test
Search result
Search result
Highlight effect
Highlight effect
Multiple file test
Multiple file test
Imported files view
Imported files view
Search results
Search results
Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaElasticsearchFull‑Text SearchIngest PipelineIK Analyzer
Programmer DD
Written by

Programmer DD

A tinkering programmer and author of "Spring Cloud Microservices in Action"

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.