Implementing File Upload and Keyword Search with Elasticsearch and Ingest Attachment Plugin
This article demonstrates how to use Elasticsearch, its ingest‑attachment plugin, and the IK analyzer to upload various file types, preprocess them, store them in an index, and perform accurate keyword searches with highlighting, providing complete Java code examples and configuration steps.
The author, a senior architect, explains why Elasticsearch is chosen for a project that requires file upload, download, and precise keyword search across Word, PDF, and TXT documents. Elasticsearch wraps Lucene, offers REST APIs, and supports plugins for advanced features.
Elasticsearch Overview
Elasticsearch is an open‑source search engine built on Lucene, providing distributed storage and RESTful APIs. Plugins such as kibana and elasticsearch‑head are used for visualization and management.
Development Environment
Install Elasticsearch, elasticsearch‑head , and kibana . Ensure the Kibana version matches the Elasticsearch version. The default Elasticsearch port is 9200 and elasticsearch‑head runs on 9100 .
Core Problems
The two main challenges are file upload and keyword query.
File Upload
Plain text files are straightforward, but PDF and Word files contain extra metadata that must be stripped. Elasticsearch 5.x+ provides an ingest node with pipelines to preprocess documents. The ingest‑attachment plugin extracts text from binary files.
./bin/elasticsearch-plugin install ingest-attachmentDefine Ingest Pipeline
PUT /_ingest/pipeline/attachment
{
"description": "Extract attachment information",
"processors": [
{ "attachment": { "field": "content", "ignore_missing": true } },
{ "remove": { "field": "content" } }
]
}Document Mapping
PUT /docwrite
{
"mappings": {
"properties": {
"id": { "type": "keyword" },
"name": { "type": "text", "analyzer": "ik_max_word" },
"type": { "type": "keyword" },
"attachment": {
"properties": {
"content": { "type": "text", "analyzer": "ik_smart" }
}
}
}
}
}After defining the pipeline and mapping, files are converted to Base64 and indexed.
Keyword Query
Elasticsearch’s default analyzer splits Chinese characters into single letters, which is not ideal. The IK analyzer provides two modes: ik_max_word (maximal segmentation) and ik_smart (smart segmentation). Using ik_smart yields desired tokens such as “进口” and “红酒”.
Install IK Analyzer
./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/...your_version...Search Example
GET /docwrite/_search
{
"query": {
"match": {
"attachment.content": {
"query": "实验一",
"analyzer": "ik_smart"
}
}
},
"highlight": {
"fields": { "attachment.content": {} },
"pre_tags": ["
"],
"post_tags": ["
"]
}
}The response includes highlighted snippets where the searched terms appear.
Encoding and Java Integration
Using IDEA + Maven, add the Elasticsearch high‑level REST client dependency matching the Elasticsearch version:
<dependency>
<groupId>org.elasticsearch.client</groupId>
<artifactId>elasticsearch-rest-high-level-client</artifactId>
<version>7.9.1</version>
</dependency>A FileObj class stores file metadata and Base64 content. The readFile method reads a file, encodes it, and returns a FileObj instance.
public class FileObj {
String id; // file id
String name; // file name
String type; // pdf, word, txt
String content; // Base64 encoded content
}
public FileObj readFile(String path) throws IOException {
File file = new File(path);
FileObj fileObj = new FileObj();
fileObj.setName(file.getName());
fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
byte[] bytes = getContent(file);
String base64 = Base64.getEncoder().encodeToString(bytes);
fileObj.setContent(base64);
return fileObj;
}Upload uses IndexRequest with the pipeline:
public void upload(FileObj file) throws IOException {
IndexRequest indexRequest = new IndexRequest("fileindex");
indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
indexRequest.setPipeline("attachment");
IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
System.out.println(response);
}Search uses SearchRequest with the IK analyzer:
SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart"));
searchRequest.source(srb);Multi‑File Testing
A demo folder containing various file types is uploaded and visualized via elasticsearch‑head . Search results demonstrate correct extraction and highlighting.
Remaining Issues
Elasticsearch truncates content longer than 100 k characters; further investigation is needed for large documents.
Reading entire files into memory can cause OOM for very large files; streaming or chunked processing should be considered for production.
The article concludes with additional resources and links to related open‑source projects.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.