Artificial Intelligence 11 min read

Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide

This article explains how Spring AI implements the ETL pipeline for Retrieval‑Augmented Generation, detailing the three core components—DocumentReader, DocumentTransformer, and DocumentWriter—along with concrete code examples, configuration parameters, and processing steps for text, PDF, and Tika document sources.

Spring Full-Stack Practical Cases

Jun 6, 2026

Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide

ETL Pipeline Overview

In Retrieval‑Augmented Generation (RAG) scenarios, an ETL (Extract‑Transform‑Load) pipeline moves data from raw sources to a structured vector store, ensuring the data is in the optimal format for AI model retrieval.

Core Components

DocumentReader – implements Supplier<List<Document>> to read raw files (PDF, text, HTML, etc.) and produce Document objects.

DocumentTransformer – implements Function<List<Document>, List<Document>> to convert or enrich documents (e.g., splitting, keyword extraction).

DocumentWriter – implements Consumer<List<Document>> to persist processed documents to a file or a vector store.

DocumentReader Implementations

TextReaderComponent

@Component
public class TextReaderComponent implements SmartInitializingSingleton {
  private final Resource resource;
  public TextReaderComponent(@Value("classpath:spring-source.txt") Resource resource) {
    this.resource = resource;
  }
  public List<Document> loadText() {
    TextReader textReader = new TextReader(this.resource);
    textReader.getCustomMetadata().put("filename", "text-source.txt");
    return textReader.read();
  }
  @Override
  public void afterSingletonsInstantiated() {
    System.err.println(loadText());
  }
}

The reader creates a single Document containing the full file content, automatically adds metadata (charset, source) and any custom metadata added via getCustomMetadata().

PagePdfDocumentReader

Maven dependency:

org.springframework.ai:spring-ai-pdf-document-reader

@Component
public class PdfReaderComponent implements SmartInitializingSingleton {
  public List<Document> getDocsFromPdf() {
    PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(
      "classpath:spring-source.pdf",
      PdfDocumentReaderConfig.builder()
        .withPageTopMargin(0)
        .withPageExtractedTextFormatter(
          ExtractedTextFormatter.builder()
            .withNumberOfTopTextLinesToDelete(0)
            .build())
        .withPagesPerDocument(1)
        .build());
    return pdfReader.read();
  }
  @Override
  public void afterSingletonsInstantiated() {
    System.err.println(getDocsFromPdf());
  }
}

The component reads each PDF page as a separate Document.

TikaDocumentReader

Maven dependency:

org.springframework.ai:spring-ai-tika-document-reader

@Component
public class TikaReaderComponent implements SmartInitializingSingleton {
  private final Resource resource;
  public TikaReaderComponent(@Value("classpath:spring-source.docx") Resource resource) {
    this.resource = resource;
  }
  public List<Document> loadText() {
    TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
    return tikaDocumentReader.read();
  }
  @Override
  public void afterSingletonsInstantiated() {
    System.err.println(loadText());
  }
}

The reader extracts text from PDF, DOC/DOCX, PPT/PPTX, HTML and other formats supported by Apache Tika.

DocumentTransformer

TokenTextSplitter

List<Document> docs = new TikaDocumentReader(new ClassPathResource("spring-source.pdf"))
  .get().stream()
  .peek(document -> document.getMetadata().put("source", "spring-source.pdf"))
  .toList();
TextSplitter ts = new TokenTextSplitter(300, 200, 10, 5000, true);
docs = ts.apply(docs);
System.err.println(docs);

defaultChunkSize : target token count per chunk (default 800).

minChunkSizeChars : minimum characters per chunk (default 350).

minChunkLengthToEmbed : minimum length for a chunk to be kept (default 5).

maxNumChunks : maximum number of chunks generated (default 10000).

keepSeparator : whether to retain separators such as line breaks (default true).

Encode text using CL100K_BASE.

Split encoded text into chunks according to defaultChunkSize.

For each chunk, decode back to text, locate a suitable breakpoint (., ?, !, or newline) after meeting minChunkSizeChars, truncate at the breakpoint, optionally trim separators based on keepSeparator, and keep the chunk if its length exceeds minChunkLengthToEmbed.

Repeat until all tokens are processed or maxNumChunks is reached.

If leftover text exceeds minChunkLengthToEmbed, add it as the final chunk.

KeywordMetadataEnricher

@RestController
@RequestMapping("/keyword")
public class KeywordMetadataController {
  @Value("classpath:spring-source.pdf")
  private Resource resource;
  private final ChatModel chatModel;
  public KeywordMetadataController(ChatModel chatModel) { this.chatModel = chatModel; }
  @GetMapping("")
  public ResponseEntity<?> gen() throws Throwable {
    KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, 5);
    File file = resource.getFile();
    List<Document> docs = new TikaDocumentReader(resource)
      .get().stream()
      .peek(d -> d.getMetadata().put("source", file))
      .toList();
    List<Document> result = enricher.apply(docs);
    return ResponseEntity.ok(result);
  }
}

Create a prompt from each document’s content.

Send the prompt to the provided ChatModel to generate keywords.

Insert the generated keywords into the document’s metadata under the key excerpt_keywords.

Return the enriched documents.

DocumentWriter

FileDocumentWriter

ClassPathResource resource = new ClassPathResource("spring-source.docx");
TokenTextSplitter textSplitter = new TokenTextSplitter();
File file = resource.getFile();
List<Document> docs = new TikaDocumentReader(resource)
  .get().stream()
  .peek(d -> d.getMetadata().put("source", file))
  .toList();
List<Document> result = textSplitter.apply(docs);
FileDocumentWriter writer = new FileDocumentWriter("d:\\output.txt", true, MetadataMode.ALL, false);
writer.accept(result);

VectorStoreDocumentWriter

Spring AI also provides VectorStoreDocumentWriter for persisting documents to a vector database; usage details are available in the Spring AI documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

RAG ETL spring-ai DocumentReader KeywordMetadataEnricher TokenTextSplitter

Written by

Spring Full-Stack Practical Cases

Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.