How to Build a Spring Boot File Upload Service with Elasticsearch Text Extraction and Search

This guide walks through creating a Spring Boot backend that accepts PDF, Word, and TXT uploads, extracts their text using Elasticsearch's ingest‑attachment plugin, stores metadata in MySQL, and provides fuzzy search and highlighted results via Elasticsearch queries.

Java High-Performance Architecture
Java High-Performance Architecture
Java High-Performance Architecture
How to Build a Spring Boot File Upload Service with Elasticsearch Text Extraction and Search

Requirement

The product needs a feature that allows users to upload PDF, WORD, or TXT files, then perform fuzzy searches on file names or content and view the files online.

Environment

Backend: Spring Boot + MyBatis‑Plus + MySQL + Elasticsearch

Search engine: Elasticsearch 7.9.3 with Kibana UI

Implementation Steps

1. Set up the environment

Elasticsearch and Kibana installation is omitted; ensure the Java Elasticsearch client version matches the server version.

2. File content recognition

Install the Ingest Attachment Processor Plugin to extract text from attachments. elasticsearch-plugin install ingest-attachment When using Docker, install the plugin inside the container:

# docker exec -it es bash
cd bin/
elasticsearch-plugin install ingest-attachment

After installation, restart Elasticsearch.

3. Create an ingest pipeline

The pipeline extracts attachment content and removes the raw field.

{
  "description": "Extract attachment information",
  "processors": [
    {"attachment": {"field": "content", "ignore_missing": true}},
    {"remove": {"field": "content"}}
  ]
}

4. Define the index mapping

The mapping specifies field types and analyzers (using Jieba for Chinese tokenization).

{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "fileName": {"type": "text", "analyzer": "my_ana"},
      "contentType": {"type": "text", "analyzer": "my_ana"},
      "fileUrl": {"type": "text"},
      "attachment": {
        "properties": {
          "content": {"type": "text", "analyzer": "my_ana"}
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {"type": "stop", "stopwords_path": "stopword/stopwords.txt"},
        "jieba_synonym": {"type": "synonym", "synonyms_path": "synonym/synonyms.txt"}
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": ["lowercase", "jieba_stop", "jieba_synonym"]
        }
      }
    }
  }
}
Note: Search must target the attachment.content field and use an analyzer; otherwise the content will not be searchable.

5. Test the pipeline

Upload a file, convert it to Base64, and index it with the pipeline:

{
  "id": "1",
  "name": "进口红酒",
  "filetype": "pdf",
  "contenttype": "文章",
  "content": "文章内容"
}

Use an online Base64 converter (e.g., https://www.zhangxinxu.com/sp/base64.html) for the file content.

6. Query uploaded files

Search the indexed documents and view highlighted matches.

{
  "took": 861,
  "hits": {
    "total": {"value": 5, "relation": "eq"},
    "hits": [
      {
        "_source": {
          "fileName": "测试_20220809164145A002.docx",
          "attachment": {"content": "内容"},
          "fileUrl": "http://localhost:8092/fileInfo/profile/upload/fileInfo/2022/08/09/测试_20220809164145A002.docx",
          "contentType": "文章",
          "fileType": "docx"
        }
      }
      // ... other hits ...
    ]
  }
}

Code Overview

The following snippets illustrate the main components.

YAML configuration

# Data source configuration
spring:
  devtools:
    restart:
      enabled: true
  elasticsearch:
    rest:
      url: 127.0.0.1
      uris: 127.0.0.1:9200
      connection-timeout: 1000
      read-timeout: 3000
      username: elastic
      password: 123456

Elasticsearch client bean

package com.yj.rselasticsearch.domain.config;

import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.time.Duration;

@Configuration
public class ElasticsearchConfig {
    @Value("${spring.elasticsearch.rest.url}")
    private String edUrl;
    @Value("${spring.elasticsearch.rest.username}")
    private String userName;
    @Value("${spring.elasticsearch.rest.password}")
    private String password;

    @Bean
    public RestHighLevelClient restHighLevelClient() {
        BasicCredentialsProvider credentialsProvider = new BasicCredentialsProvider();
        credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(userName, password));
        RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(
                new HttpHost(edUrl, 9200, "http"))
                .setHttpClientConfigCallback(httpClientBuilder -> {
                    httpClientBuilder.disableAuthCaching();
                    httpClientBuilder.setKeepAliveStrategy((response, context) -> Duration.ofMinutes(5).toMillis());
                    return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
                })
        );
        return client;
    }
}

Entity class

package com.yj.common.core.domain.entity;

import org.springframework.data.elasticsearch.annotations.Document;
import org.springframework.data.elasticsearch.annotations.Field;
import org.springframework.data.elasticsearch.annotations.FieldType;

import java.util.Date;

@Document(indexName = "fileinfo", createIndex = false)
public class FileInfo {
    @Field(name = "id", type = FieldType.Integer)
    private Integer id;

    @Field(name = "fileName", type = FieldType.Text, analyzer = "jieba_index", searchAnalyzer = "jieba_index")
    private String fileName;

    @Field(name = "fileType", type = FieldType.Keyword)
    private String fileType;

    @Field(name = "contentType", type = FieldType.Text)
    private String contentType;

    @Field(name = "attachment.content", type = FieldType.Text, analyzer = "jieba_index", searchAnalyzer = "jieba_index")
    private String content;

    @Field(name = "fileUrl", type = FieldType.Text)
    private String fileUrl;

    private Date createTime;
    private Date updateTime;
}

Controller for file upload

package com.yj.rselasticsearch.controller;

import com.yj.common.core.controller.BaseController;
import com.yj.common.core.domain.AjaxResult;
import com.yj.common.core.domain.entity.FileInfo;
import com.yj.rselasticsearch.service.FileInfoService;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import javax.annotation.Resource;

@RestController
@RequestMapping("/fileInfo")
public class FileInfoController extends BaseController {
    @Resource
    private FileInfoService fileInfoService;

    @PutMapping("uploadFile")
    public AjaxResult uploadFile(String contentType, MultipartFile file) {
        return fileInfoService.uploadFileInfo(contentType, file);
    }
}

Service implementation (upload & index)

package com.yj.rselasticsearch.service.impl;

import com.alibaba.fastjson.JSON;
import com.yj.common.core.domain.AjaxResult;
import com.yj.common.utils.file.FileUploadUtils;
import com.yj.rselasticsearch.domain.entity.FileInfo;
import com.yj.rselasticsearch.mapper.FileInfoMapper;
import com.yj.rselasticsearch.service.FileInfoService;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.beans.factory.annotation.Qualifier;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.Base64;

@Service
public class FileInfoServiceImpl implements FileInfoService {
    @Autowired
    @Qualifier("restHighLevelClient")
    private RestHighLevelClient client;

    @Resource
    private FileInfoMapper fileInfoMapper;

    @Override
    public AjaxResult uploadFileInfo(String contentType, MultipartFile file) {
        // Validate parameters
        if (contentType == null || file == null) {
            return AjaxResult.error("请求参数不能为空");
        }
        try {
            String filePath = "/upload/fileInfo"; // simplified path
            String fileName = FileUploadUtils.upload(filePath, file);
            String prefix = fileName.substring(fileName.lastIndexOf('.') + 1);
            File temp = File.createTempFile(fileName, prefix);
            file.transferTo(temp);
            String url = "http://localhost:8092" + "/fileInfo/" + fileName;

            FileInfo fileInfo = new FileInfo();
            fileInfo.setFileName(fileName);
            fileInfo.setFileType(prefix);
            fileInfo.setFileUrl(url);
            fileInfo.setContentType(contentType);
            fileInfoMapper.insertSelective(fileInfo);

            byte[] bytes = getContent(temp);
            String base64 = Base64.getEncoder().encodeToString(bytes);
            fileInfo.setContent(base64);

            IndexRequest request = new IndexRequest("fileinfo");
            request.source(JSON.toJSONString(fileInfo), XContentType.JSON);
            request.setPipeline("attachment");
            IndexResponse response = client.index(request, RequestOptions.DEFAULT);
            return AjaxResult.success(fileInfo);
        } catch (Exception e) {
            return AjaxResult.error(e.getMessage());
        }
    }

    private byte[] getContent(File file) throws IOException {
        long size = file.length();
        if (size > Integer.MAX_VALUE) return null;
        FileInputStream fis = new FileInputStream(file);
        byte[] buffer = new byte[(int) size];
        int offset = 0, read;
        while (offset < buffer.length && (read = fis.read(buffer, offset, buffer.length - offset)) >= 0) {
            offset += read;
        }
        fis.close();
        return buffer;
    }
}

Search service (highlight & suggestion)

package com.yj.rselasticsearch.service.impl;

import com.yj.rselasticsearch.domain.dto.WarningInfoDto;
import com.yj.rselasticsearch.domain.entity.FileInfo;
import org.elasticsearch.index.query.BoolQueryBuilder;
import org.elasticsearch.index.query.QueryBuilders;
import org.elasticsearch.search.fetch.subphase.highlight.HighlightBuilder;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.data.domain.PageRequest;
import org.springframework.data.domain.Pageable;
import org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate;
import org.springframework.data.elasticsearch.core.SearchHits;
import org.springframework.data.elasticsearch.core.query.NativeSearchQuery;
import org.springframework.data.elasticsearch.core.query.NativeSearchQueryBuilder;
import org.springframework.stereotype.Service;

import java.util.*;
import java.util.stream.Collectors;

@Service
public class ElasticsearchServiceImpl implements ElasticsearchService {
    @Autowired
    private ElasticsearchRestTemplate elasticsearchRestTemplate;

    @Override
    public List<String> getAssociationalWordOther(WarningInfoDto dto) {
        BoolQueryBuilder qb = QueryBuilders.boolQuery()
                .should(QueryBuilders.matchBoolPrefixQuery("fileName", dto.getKeyword()))
                .must(QueryBuilders.termsQuery("contentType", dto.getContentType()));
        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .withQuery(qb)
                .withHighlightFields(new HighlightBuilder.Field("fileName"))
                .withHighlightBuilder(new HighlightBuilder().preTags("<span style='color:red'>").postTags("</span>"))
                .build();
        SearchHits<FileInfo> hits = elasticsearchRestTemplate.search(query, FileInfo.class);
        List<String> suggestions = new ArrayList<>();
        for (var hit : hits) {
            Map<String, List<String>> hl = hit.getHighlightFields();
            if (hl.get("fileName") != null) {
                suggestions.add(hl.get("fileName").get(0));
            }
        }
        return suggestions.stream().distinct().limit(9).collect(Collectors.toList());
    }

    @Override
    public IPage<FileInfo> queryHighLightWordOther(WarningInfoDto dto) {
        Pageable pageable = PageRequest.of(dto.getPageIndex() - 1, dto.getPageSize());
        BoolQueryBuilder qb = QueryBuilders.boolQuery()
                .should(QueryBuilders.matchBoolPrefixQuery("fileName", dto.getKeyword()))
                .should(QueryBuilders.matchBoolPrefixQuery("attachment.content", dto.getKeyword()))
                .must(QueryBuilders.termsQuery("contentType", dto.getContentType()));
        NativeSearchQuery query = new NativeSearchQueryBuilder()
                .withQuery(qb)
                .withHighlightFields(new HighlightBuilder.Field("fileName"), new HighlightBuilder.Field("attachment.content"))
                .withHighlightBuilder(new HighlightBuilder().preTags("<span style='color:red'>").postTags("</span>"))
                .withPageable(pageable)
                .build();
        SearchHits<FileInfo> hits = elasticsearchRestTemplate.search(query, FileInfo.class);
        List<FileInfo> results = new ArrayList<>();
        for (var hit : hits) {
            Map<String, List<String>> hl = hit.getHighlightFields();
            FileInfo fi = hit.getContent();
            if (hl.get("fileName") != null) {
                fi.setFileName(hl.get("fileName").get(0));
            }
            if (hl.get("attachment.content") != null) {
                fi.setContent(hl.get("attachment.content").get(0));
            }
            results.add(fi);
        }
        IPage<FileInfo> page = new com.baomidou.mybatisplus.extension.plugins.pagination.Page<>();
        page.setCurrent(dto.getPageIndex());
        page.setSize(dto.getPageSize());
        page.setTotal(hits.getTotalHits());
        page.setRecords(results);
        return page;
    }
}

The implementation demonstrates uploading files, converting them to Base64, indexing with Elasticsearch's attachment pipeline, and performing fuzzy, highlighted searches on both file names and extracted content.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

JavaElasticsearchSpring BootFile UploadSearchtext extraction
Java High-Performance Architecture
Written by

Java High-Performance Architecture

Sharing Java development articles and resources, including SSM architecture and the Spring ecosystem (Spring Boot, Spring Cloud, MyBatis, Dubbo, Docker), Zookeeper, Redis, architecture design, microservices, message queues, Git, etc.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.