Using Elasticsearch for File Upload, Text Extraction, and Keyword Search with Ingest Pipelines and IK Analyzer
This tutorial explains how to leverage Elasticsearch to support file upload and download, preprocess PDF/Word/TXT files via ingest pipelines and the attachment processor, configure index mappings with Chinese IK analyzers, and perform accurate keyword searches with highlighting, all demonstrated with Java code examples.
The author, a senior architect, needs a system that can upload and download files and allow keyword searches inside Word, PDF, and TXT documents, and chooses Elasticsearch as the core technology.
Elasticsearch is an open‑source search engine built on Apache Lucene that provides a simple REST API for indexing and querying documents.
Two plugins are used for visualization: kibana for request building and elasticsearch‑head as a UI for the Elasticsearch API.
Development environment : install matching versions of Elasticsearch, Kibana and head; note that Kibana’s version must correspond to Elasticsearch’s version.
Core problems : (1) file upload with content extraction, (2) keyword query with proper Chinese tokenization.
For file upload, the ingest‑attachment plugin is installed ( ./bin/elasticsearch-plugin install ingest-attachment ) and a pipeline is defined to extract text from attachments. The pipeline JSON is:
{
"description": "Extract attachment information",
"processors": [
{ "attachment": { "field": "content", "ignore_missing": true } },
{ "remove": { "field": "content" } }
]
}An index mapping docwrite is created with fields id , name (analyzed with ik_max_word ), type , and an attachment object that stores the extracted text using the ik_smart analyzer.
Testing involves converting a file to Base64, creating a FileObj Java bean, and sending it to Elasticsearch via IndexRequest with setPipeline("attachment") . The document can then be retrieved with a GET request to verify successful extraction.
For keyword search, the default Unicode analyzer splits characters too finely, so the Chinese IK analyzer is required. The IK plugin is installed (e.g., ./bin/elasticsearch-plugin install https://github.com/medcl/elasticsearch-analysis-ik/releases/download/.../elasticsearch-analysis-ik-7.9.1.zip ) and two modes are available: ik_max_word (maximal segmentation) and ik_smart (smart segmentation). The ik_smart mode yields tokens like "进口" and "红酒" for the phrase "进口红酒".
Search queries use SearchSourceBuilder with matchQuery("attachment.content", keyword).analyzer("ik_smart") . Highlighting is added via HighlightBuilder with pre‑tags <em> and post‑tags </em> to emphasize matched terms.
Key Java code snippets:
public FileObj readFile(String path) throws IOException {
File file = new File(path);
FileObj fileObj = new FileObj();
fileObj.setName(file.getName());
fileObj.setType(file.getName().substring(file.getName().lastIndexOf(".") + 1));
byte[] bytes = getContent(file);
String base64 = Base64.getEncoder().encodeToString(bytes);
fileObj.setContent(base64);
return fileObj;
}
public void upload(FileObj file) throws IOException {
IndexRequest indexRequest = new IndexRequest("fileindex");
indexRequest.source(JSON.toJSONString(file), XContentType.JSON);
indexRequest.setPipeline("attachment");
IndexResponse response = client.index(indexRequest, RequestOptions.DEFAULT);
System.out.println(response);
}
SearchSourceBuilder srb = new SearchSourceBuilder();
srb.query(QueryBuilders.matchQuery("attachment.content", keyword).analyzer("ik_smart")));
HighlightBuilder hb = new HighlightBuilder();
hb.field(new HighlightBuilder.Field("attachment.content")).preTags("
").postTags("
");
srb.highlighter(hb);Remaining challenges include Elasticsearch’s default 100 KB limit on extracted text and high memory consumption when reading large files entirely into memory, which may require streaming or chunked processing in production.
Top Architect
Top Architect focuses on sharing practical architecture knowledge, covering enterprise, system, website, large‑scale distributed, and high‑availability architectures, plus architecture adjustments using internet technologies. We welcome idea‑driven, sharing‑oriented architects to exchange and learn together.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.