How to Implement Full-Text Search for Word, PDF, and TXT Files with Elasticsearch
This guide explains how to upload Word, PDF, and TXT files, preprocess them with Elasticsearch ingest pipelines, index their content using appropriate analyzers, and perform accurate keyword searches with highlighting, providing complete Java code examples and configuration steps.
Elasticsearch Overview
Elasticsearch is an open‑source search engine built on Apache Lucene that exposes a REST API for indexing and querying documents. It wraps Lucene’s complexity and adds distributed storage, making it suitable for full‑text search across various file types.
Development Environment
Install Elasticsearch, Kibana, and elasticsearch‑head. Ensure Kibana’s version matches the Elasticsearch version.
Core Problems
The two main challenges are file upload (including preprocessing for PDF/Word) and keyword querying.
File Upload
Plain text files can be indexed directly, but PDF and Word files contain extra metadata that must be stripped. Elasticsearch 5.x+ provides an ingest node with the ingest‑attachment plugin to extract text from these formats.
./bin/elasticsearch-plugin install ingest-attachmentDefine Ingest Pipeline
PUT /_ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{ "attachment": { "field": "content", "ignore_missing": true } },
{ "remove": { "field": "content" } }
]
}Define Index Mapping
PUT /docwrite
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"name": { "type": "text", "analyzer": "ik_max_word" },
"type": { "type": "keyword" },
"attachment": {
"properties": {
"content": { "type": "text", "analyzer": "ik_smart" }
}
}
}
}
}Encoding Files in Java
Read a file, convert its bytes to Base64, and store the result in a FileObj object.
public class FileObj {
String id; // file id
String name; // file name
String type; // pdf, word, or txt
String content; // Base64‑encoded file content
}
public FileObj readFile(String path) throws IOException {
File file = new File(path);
FileObj fileObj = new FileObj();
fileObj.setName(file.getName());
fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
byte[] bytes = getContent(file);
String base64 = Base64.getEncoder().encodeToString(bytes);
fileObj.setContent(base64);
return fileObj;
}Uploading to Elasticsearch
public void upload(FileObj file) throws IOException {
IndexRequest indexRequest = new IndexRequest("fileindex");
indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
indexRequest.setPipeline("attachment");
IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
System.out.println(response);
}Keyword Query
Use the IK smart analyzer to split Chinese text into meaningful tokens and enable highlighting.
SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);
HighlightBuilder hb = new HighlightBuilder();
HighlightBuilder.Field hf = new HighlightBuilder.Field("attachment.content");
hb.field(hf);
hb.preTags("<em>");
hb.postTags("</em>");
srb.highlighter(hb);Testing and Multi‑File Upload
Upload multiple files, view them in Kibana or elasticsearch‑head, and run search queries to verify that extracted text is searchable and highlighted correctly.
Remaining Issues
Elasticsearch truncates content longer than 100 000 characters; further investigation is needed for larger texts.
Reading entire files into memory can cause out‑of‑memory errors for very large files; streaming or chunked processing may be required in production.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
