Backend Development 21 min read

Implementing File Upload and Text Extraction with Elasticsearch Ingest Attachment Plugin in Spring Boot

This tutorial explains how to let users upload PDF, Word, or TXT files, install the Elasticsearch Ingest Attachment Processor Plugin, create an ingest pipeline and index mapping, convert files to Base64, and perform fuzzy searches with highlighted results using Spring Boot and Java code examples.

Code Ape Tech Column
Code Ape Tech Column
Code Ape Tech Column
Implementing File Upload and Text Extraction with Elasticsearch Ingest Attachment Plugin in Spring Boot

Hello everyone, I'm Chen, the author.

The product requires a feature that lets users upload PDF, WORD, or TXT files, perform fuzzy search by file name or content, and view the content online.

Environment

Project development environment:

Backend management system: Spring Boot + MyBatis-Plus + MySQL + Elasticsearch

Search engine: Elasticsearch 7.9.3 with Kibana UI

Implementation Steps

1. Set up environment

Elasticsearch and Kibana installation is omitted; ensure the Java Elasticsearch client version matches the ES version.

2. File content recognition

Install the Ingest Attachment Processor Plugin to extract text from attachments.

elasticsearch-plugin install ingest-attachment

When using Docker, install the plugin inside the container:

[root@... ]# docker exec -it es bash
... 
elasticsearch-plugin install ingest-attachment
...

After installation, restart Elasticsearch.

3. Create an ingest pipeline

{
  "description": "Extract attachment information",
  "processors": [
    {
      "attachment": {
        "field": "content",
        "ignore_missing": true
      }
    },
    {
      "remove": {
        "field": "content"
      }
    }
  ]
}

4. Define the index mapping

{
  "mappings": {
    "properties": {
      "id": {"type": "keyword"},
      "fileName": {"type": "text", "analyzer": "my_ana"},
      "contentType": {"type": "text", "analyzer": "my_ana"},
      "fileUrl": {"type": "text"},
      "attachment": {
        "properties": {
          "content": {"type": "text", "analyzer": "my_ana"}
        }
      }
    }
  },
  "settings": {
    "analysis": {
      "filter": {
        "jieba_stop": {"type": "stop", "stopwords_path": "stopword/stopwords.txt"},
        "jieba_synonym": {"type": "synonym", "synonyms_path": "synonym/synonyms.txt"}
      },
      "analyzer": {
        "my_ana": {
          "tokenizer": "jieba_index",
          "filter": ["lowercase", "jieba_stop", "jieba_synonym"]
        }
      }
    }
  }
}

Note: The searchable field is attachment.content and must be analyzed.

5. Test indexing

{
  "id":"1",
  "name":"Imported Red Wine",
  "filetype":"pdf",
  "contenttype":"article",
  "content":"Article content"
}

Convert the file to Base64 before sending (e.g., https://www.zhangxinxu.com/sp/base64.html).

Code

Key configuration and implementation files are shown below.

application.yml

# Data source configuration
spring:
  devtools:
    restart:
      enabled: true
  elasticsearch:
    rest:
      url: 127.0.0.1
      uris: 127.0.0.1:9200
      connection-timeout: 1000
      read-timeout: 3000
      username: elastic
      password: 123456

ElasticsearchConfig.java

package com.yj.rselasticsearch.domain.config;

import org.apache.http.HttpHost;
import org.apache.http.auth.AuthScope;
import org.apache.http.auth.UsernamePasswordCredentials;
import org.apache.http.impl.client.BasicCredentialsProvider;
import org.elasticsearch.client.RestClient;
import org.elasticsearch.client.RestHighLevelClient;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.time.Duration;

@Configuration
public class ElasticsearchConfig {
    @Value("${spring.elasticsearch.rest.url}")
    private String edUrl;
    @Value("${spring.elasticsearch.rest.username}")
    private String userName;
    @Value("${spring.elasticsearch.rest.password}")
    private String password;

    @Bean
    public RestHighLevelClient restHighLevelClient() {
        final BasicCredentialsProvider credentialsProvider = new BasicCredentialsProvider();
        credentialsProvider.setCredentials(AuthScope.ANY, new UsernamePasswordCredentials(userName, password));
        RestHighLevelClient client = new RestHighLevelClient(RestClient.builder(
                new HttpHost(edUrl, 9200, "http"))
                .setHttpClientConfigCallback(httpClientBuilder -> {
                    httpClientBuilder.disableAuthCaching();
                    httpClientBuilder.setKeepAliveStrategy((response, context) -> Duration.ofMinutes(5).toMillis());
                    return httpClientBuilder.setDefaultCredentialsProvider(credentialsProvider);
                }));
        return client;
    }
}

FileInfo entity

package com.yj.common.core.domain.entity;

import com.baomidou.mybatisplus.annotation.TableField;
import com.yj.common.core.domain.BaseEntity;
import lombok.Data;
import lombok.EqualsAndHashCode;
import lombok.Getter;
import lombok.Setter;
import org.springframework.data.elasticsearch.annotations.Document;
import org.springframework.data.elasticsearch.annotations.Field;
import org.springframework.data.elasticsearch.annotations.FieldType;

import java.util.Date;

@Setter
@Getter
@Document(indexName = "fileinfo", createIndex = false)
public class FileInfo {
    @Field(name = "id", type = FieldType.Integer)
    private Integer id;

    @Field(name = "fileName", type = FieldType.Text, analyzer = "jieba_index", searchAnalyzer = "jieba_index")
    private String fileName;

    @Field(name = "fileType", type = FieldType.Keyword)
    private String fileType;

    @Field(name = "contentType", type = FieldType.Text)
    private String contentType;

    @Field(name = "attachment.content", type = FieldType.Text, analyzer = "jieba_index", searchAnalyzer = "jieba_index")
    @TableField(exist = false)
    private String content;

    @Field(name = "fileUrl", type = FieldType.Text)
    private String fileUrl;

    private Date createTime;
    private Date updateTime;
}

FileInfoController.java

package com.yj.rselasticsearch.controller;

import com.yj.common.core.controller.BaseController;
import com.yj.common.core.domain.AjaxResult;
import com.yj.rselasticsearch.service.FileInfoService;
import org.springframework.web.bind.annotation.*;
import org.springframework.web.multipart.MultipartFile;

import javax.annotation.Resource;

@RestController
@RequestMapping("/fileInfo")
public class FileInfoController extends BaseController {
    @Resource
    private FileInfoService fileInfoService;

    @PutMapping("uploadFile")
    public AjaxResult uploadFile(String contentType, MultipartFile file) {
        return fileInfoService.uploadFileInfo(contentType, file);
    }
}

FileInfoServiceImpl.java (excerpt)

package com.yj.rselasticsearch.service.impl;

import com.alibaba.fastjson.JSON;
import com.yj.common.core.domain.AjaxResult;
import com.yj.rselasticsearch.mapper.FileInfoMapper;
import com.yj.rselasticsearch.service.FileInfoService;
import org.elasticsearch.action.index.IndexRequest;
import org.elasticsearch.action.index.IndexResponse;
import org.elasticsearch.client.RequestOptions;
import org.elasticsearch.client.RestHighLevelClient;
import org.elasticsearch.common.xcontent.XContentType;
import org.springframework.stereotype.Service;
import org.springframework.web.multipart.MultipartFile;

import javax.annotation.Resource;
import java.io.File;
import java.util.Base64;

@Service
public class FileInfoServiceImpl implements FileInfoService {
    @Resource
    private FileInfoMapper fileInfoMapper;
    @Resource
    private RestHighLevelClient client;

    @Override
    public AjaxResult uploadFileInfo(String contentType, MultipartFile file) {
        // Upload file, convert to Base64, index into ES with pipeline "attachment"
        // (implementation omitted for brevity)
        return AjaxResult.success();
    }

    private byte[] getContent(File file) throws IOException {
        // read file bytes
    }
}

ElasticsearchServiceImpl.java (highlight search excerpt)

// Methods getAssociationalWordOther and queryHighLightWordOther implement fuzzy search
// with highlighted results using NativeSearchQueryBuilder and ElasticsearchRestTemplate.

Testing request and response JSON demonstrate fuzzy search with highlighted keywords.

Conclusion

If this tutorial helped you, please like, share, and follow the author. Additional resources and a paid knowledge community are advertised.

JavaElasticsearchSpring BootFile Uploadtext-extractionIngest Attachment
Code Ape Tech Column
Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.