Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide

This article explains how Spring AI implements the ETL pipeline for Retrieval‑Augmented Generation, detailing the three core components—DocumentReader, DocumentTransformer, and DocumentWriter—along with concrete code examples, configuration parameters, and processing steps for text, PDF, and Tika document sources.

Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Spring Full-Stack Practical Cases
Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide

ETL Pipeline Overview

In Retrieval‑Augmented Generation (RAG) scenarios, an ETL (Extract‑Transform‑Load) pipeline moves data from raw sources to a structured vector store, ensuring the data is in the optimal format for AI model retrieval.

Core Components

DocumentReader – implements Supplier<List<Document>> to read raw files (PDF, text, HTML, etc.) and produce Document objects.

DocumentTransformer – implements Function<List<Document>, List<Document>> to convert or enrich documents (e.g., splitting, keyword extraction).

DocumentWriter – implements Consumer<List<Document>> to persist processed documents to a file or a vector store.

DocumentReader Implementations

TextReaderComponent

@Component
public class TextReaderComponent implements SmartInitializingSingleton {
  private final Resource resource;
  public TextReaderComponent(@Value("classpath:spring-source.txt") Resource resource) {
    this.resource = resource;
  }
  public List<Document> loadText() {
    TextReader textReader = new TextReader(this.resource);
    textReader.getCustomMetadata().put("filename", "text-source.txt");
    return textReader.read();
  }
  @Override
  public void afterSingletonsInstantiated() {
    System.err.println(loadText());
  }
}

The reader creates a single Document containing the full file content, automatically adds metadata (charset, source) and any custom metadata added via getCustomMetadata().

TextReader output
TextReader output

PagePdfDocumentReader

Maven dependency:

org.springframework.ai:spring-ai-pdf-document-reader
@Component
public class PdfReaderComponent implements SmartInitializingSingleton {
  public List<Document> getDocsFromPdf() {
    PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(
      "classpath:spring-source.pdf",
      PdfDocumentReaderConfig.builder()
        .withPageTopMargin(0)
        .withPageExtractedTextFormatter(
          ExtractedTextFormatter.builder()
            .withNumberOfTopTextLinesToDelete(0)
            .build())
        .withPagesPerDocument(1)
        .build());
    return pdfReader.read();
  }
  @Override
  public void afterSingletonsInstantiated() {
    System.err.println(getDocsFromPdf());
  }
}

The component reads each PDF page as a separate Document.

PDFReader output
PDFReader output

TikaDocumentReader

Maven dependency:

org.springframework.ai:spring-ai-tika-document-reader
@Component
public class TikaReaderComponent implements SmartInitializingSingleton {
  private final Resource resource;
  public TikaReaderComponent(@Value("classpath:spring-source.docx") Resource resource) {
    this.resource = resource;
  }
  public List<Document> loadText() {
    TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
    return tikaDocumentReader.read();
  }
  @Override
  public void afterSingletonsInstantiated() {
    System.err.println(loadText());
  }
}

The reader extracts text from PDF, DOC/DOCX, PPT/PPTX, HTML and other formats supported by Apache Tika.

TikaReader output
TikaReader output

DocumentTransformer

TokenTextSplitter

List<Document> docs = new TikaDocumentReader(new ClassPathResource("spring-source.pdf"))
  .get().stream()
  .peek(document -> document.getMetadata().put("source", "spring-source.pdf"))
  .toList();
TextSplitter ts = new TokenTextSplitter(300, 200, 10, 5000, true);
docs = ts.apply(docs);
System.err.println(docs);
TokenTextSplitter output
TokenTextSplitter output

defaultChunkSize : target token count per chunk (default 800).

minChunkSizeChars : minimum characters per chunk (default 350).

minChunkLengthToEmbed : minimum length for a chunk to be kept (default 5).

maxNumChunks : maximum number of chunks generated (default 10000).

keepSeparator : whether to retain separators such as line breaks (default true).

Encode text using CL100K_BASE.

Split encoded text into chunks according to defaultChunkSize.

For each chunk, decode back to text, locate a suitable breakpoint (., ?, !, or newline) after meeting minChunkSizeChars, truncate at the breakpoint, optionally trim separators based on keepSeparator, and keep the chunk if its length exceeds minChunkLengthToEmbed.

Repeat until all tokens are processed or maxNumChunks is reached.

If leftover text exceeds minChunkLengthToEmbed, add it as the final chunk.

KeywordMetadataEnricher

@RestController
@RequestMapping("/keyword")
public class KeywordMetadataController {
  @Value("classpath:spring-source.pdf")
  private Resource resource;
  private final ChatModel chatModel;
  public KeywordMetadataController(ChatModel chatModel) { this.chatModel = chatModel; }
  @GetMapping("")
  public ResponseEntity<?> gen() throws Throwable {
    KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, 5);
    File file = resource.getFile();
    List<Document> docs = new TikaDocumentReader(resource)
      .get().stream()
      .peek(d -> d.getMetadata().put("source", file))
      .toList();
    List<Document> result = enricher.apply(docs);
    return ResponseEntity.ok(result);
  }
}

Create a prompt from each document’s content.

Send the prompt to the provided ChatModel to generate keywords.

Insert the generated keywords into the document’s metadata under the key excerpt_keywords.

Return the enriched documents.

KeywordMetadataEnricher output
KeywordMetadataEnricher output

DocumentWriter

FileDocumentWriter

ClassPathResource resource = new ClassPathResource("spring-source.docx");
TokenTextSplitter textSplitter = new TokenTextSplitter();
File file = resource.getFile();
List<Document> docs = new TikaDocumentReader(resource)
  .get().stream()
  .peek(d -> d.getMetadata().put("source", file))
  .toList();
List<Document> result = textSplitter.apply(docs);
FileDocumentWriter writer = new FileDocumentWriter("d:\\output.txt", true, MetadataMode.ALL, false);
writer.accept(result);
File writer output
File writer output

VectorStoreDocumentWriter

Spring AI also provides VectorStoreDocumentWriter for persisting documents to a vector database; usage details are available in the Spring AI documentation.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

RAGETLspring-aiDocumentReaderKeywordMetadataEnricherTokenTextSplitter
Spring Full-Stack Practical Cases
Written by

Spring Full-Stack Practical Cases

Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.