Build a File Upload & Search System with Elasticsearch and IK Analyzer
This guide walks through creating a file upload service that indexes Word, PDF, and TXT files in Elasticsearch, uses an ingest‑attachment pipeline to extract text, configures Chinese IK analyzers for precise keyword search, and demonstrates Java code for indexing, querying, and highlighting results.
Elasticsearch Overview
Elasticsearch is an open‑source search engine built on top of Apache Lucene. It exposes a REST API, allowing you to send keyword queries and receive matching documents.
Development Environment
Install Elasticsearch, the elasticsearch-head UI, and kibana. All three are ready‑to‑run out of the box, but ensure the Kibana version matches the Elasticsearch version.
Core Requirements
Support file upload and download.
Search files by keyword, including the text inside Word, PDF, and TXT documents.
Why Elasticsearch
Elasticsearch provides fast full‑text search and a powerful ingest node that can preprocess documents before indexing.
Ingest Node & Attachment Processor
From Elasticsearch 5.x onward, the ingest node can run pipelines. The ingest-attachment plugin extracts text from binary files (PDF, Word, etc.) during indexing.
./bin/elasticsearch-plugin install ingest-attachmentDefine the Attachment Pipeline
PUT /_ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{"attachment": {"field": "content", "ignore_missing": true}},
{"remove": {"field": "content"}}
]
}Document Mapping
Create an index (e.g., docwrite) with a mapping that includes an attachment field for the extracted text and a name field analyzed by the IK analyzer.
PUT /docwrite
{
"mappings": {
"properties": {
"id": {"type": "keyword"},
"name": {"type": "text", "analyzer": "ik_max_word"},
"type": {"type": "keyword"},
"attachment": {
"properties": {
"content": {"type": "text", "analyzer": "ik_smart"}
}
}
}
}
}Encoding Files
Read a file, convert its bytes to Base64, and store the result in a FileObj Java class.
public class FileObj {
String id; // file id
String name; // file name
String type; // pdf, word, txt
String content; // Base64‑encoded file content
} public FileObj readFile(String path) throws IOException {
File file = new File(path);
FileObj obj = new FileObj();
obj.setName(file.getName());
obj.setType(file.getName().substring(file.getName().lastIndexOf('.') + 1));
byte[] bytes = Files.readAllBytes(file.toPath());
obj.setContent(Base64.getEncoder().encodeToString(bytes));
return obj;
}Indexing Files
Use the high‑level REST client to send an IndexRequest with the JSON representation of FileObj and specify the attachment pipeline.
public void upload(FileObj file) throws IOException {
IndexRequest req = new IndexRequest("fileindex");
req.source(JSON.toJSONString(file), XContentType.JSON);
req.setPipeline("attachment");
IndexResponse resp = client.index(req, RequestOptions.DEFAULT);
System.out.println(resp);
}Searching Files
Build a SearchRequest that queries attachment.content using the IK smart analyzer, and optionally add highlighting.
SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword)
.analyzer("ik_smart"));
searchRequest.source(srb); HighlightBuilder hb = new HighlightBuilder();
HighlightBuilder.Field hf = new HighlightBuilder.Field("attachment.content");
hb.field(hf);
hb.preTags("<em>");
hb.postTags("</em>");
srb.highlighter(hb);Multi‑File Testing
Upload a folder containing various file types and verify the indexed documents via the elasticsearch-head UI.
Remaining Issues
File length: Elasticsearch truncates content longer than 100 k characters; further configuration is needed for larger texts.
Memory usage: Reading whole files into memory can exhaust resources for very large files; streaming approaches should be considered.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Programmer DD
A tinkering programmer and author of "Spring Cloud Microservices in Action"
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
