Essential ETL Techniques for Spring AI RAG – A Must‑Read Guide
This article explains how Spring AI implements the ETL pipeline for Retrieval‑Augmented Generation, detailing the three core components—DocumentReader, DocumentTransformer, and DocumentWriter—along with concrete code examples, configuration parameters, and processing steps for text, PDF, and Tika document sources.
ETL Pipeline Overview
In Retrieval‑Augmented Generation (RAG) scenarios, an ETL (Extract‑Transform‑Load) pipeline moves data from raw sources to a structured vector store, ensuring the data is in the optimal format for AI model retrieval.
Core Components
DocumentReader – implements Supplier<List<Document>> to read raw files (PDF, text, HTML, etc.) and produce Document objects.
DocumentTransformer – implements Function<List<Document>, List<Document>> to convert or enrich documents (e.g., splitting, keyword extraction).
DocumentWriter – implements Consumer<List<Document>> to persist processed documents to a file or a vector store.
DocumentReader Implementations
TextReaderComponent
@Component
public class TextReaderComponent implements SmartInitializingSingleton {
private final Resource resource;
public TextReaderComponent(@Value("classpath:spring-source.txt") Resource resource) {
this.resource = resource;
}
public List<Document> loadText() {
TextReader textReader = new TextReader(this.resource);
textReader.getCustomMetadata().put("filename", "text-source.txt");
return textReader.read();
}
@Override
public void afterSingletonsInstantiated() {
System.err.println(loadText());
}
}The reader creates a single Document containing the full file content, automatically adds metadata (charset, source) and any custom metadata added via getCustomMetadata().
PagePdfDocumentReader
Maven dependency:
org.springframework.ai:spring-ai-pdf-document-reader @Component
public class PdfReaderComponent implements SmartInitializingSingleton {
public List<Document> getDocsFromPdf() {
PagePdfDocumentReader pdfReader = new PagePdfDocumentReader(
"classpath:spring-source.pdf",
PdfDocumentReaderConfig.builder()
.withPageTopMargin(0)
.withPageExtractedTextFormatter(
ExtractedTextFormatter.builder()
.withNumberOfTopTextLinesToDelete(0)
.build())
.withPagesPerDocument(1)
.build());
return pdfReader.read();
}
@Override
public void afterSingletonsInstantiated() {
System.err.println(getDocsFromPdf());
}
}The component reads each PDF page as a separate Document.
TikaDocumentReader
Maven dependency:
org.springframework.ai:spring-ai-tika-document-reader @Component
public class TikaReaderComponent implements SmartInitializingSingleton {
private final Resource resource;
public TikaReaderComponent(@Value("classpath:spring-source.docx") Resource resource) {
this.resource = resource;
}
public List<Document> loadText() {
TikaDocumentReader tikaDocumentReader = new TikaDocumentReader(this.resource);
return tikaDocumentReader.read();
}
@Override
public void afterSingletonsInstantiated() {
System.err.println(loadText());
}
}The reader extracts text from PDF, DOC/DOCX, PPT/PPTX, HTML and other formats supported by Apache Tika.
DocumentTransformer
TokenTextSplitter
List<Document> docs = new TikaDocumentReader(new ClassPathResource("spring-source.pdf"))
.get().stream()
.peek(document -> document.getMetadata().put("source", "spring-source.pdf"))
.toList();
TextSplitter ts = new TokenTextSplitter(300, 200, 10, 5000, true);
docs = ts.apply(docs);
System.err.println(docs);defaultChunkSize : target token count per chunk (default 800).
minChunkSizeChars : minimum characters per chunk (default 350).
minChunkLengthToEmbed : minimum length for a chunk to be kept (default 5).
maxNumChunks : maximum number of chunks generated (default 10000).
keepSeparator : whether to retain separators such as line breaks (default true).
Encode text using CL100K_BASE.
Split encoded text into chunks according to defaultChunkSize.
For each chunk, decode back to text, locate a suitable breakpoint (., ?, !, or newline) after meeting minChunkSizeChars, truncate at the breakpoint, optionally trim separators based on keepSeparator, and keep the chunk if its length exceeds minChunkLengthToEmbed.
Repeat until all tokens are processed or maxNumChunks is reached.
If leftover text exceeds minChunkLengthToEmbed, add it as the final chunk.
KeywordMetadataEnricher
@RestController
@RequestMapping("/keyword")
public class KeywordMetadataController {
@Value("classpath:spring-source.pdf")
private Resource resource;
private final ChatModel chatModel;
public KeywordMetadataController(ChatModel chatModel) { this.chatModel = chatModel; }
@GetMapping("")
public ResponseEntity<?> gen() throws Throwable {
KeywordMetadataEnricher enricher = new KeywordMetadataEnricher(chatModel, 5);
File file = resource.getFile();
List<Document> docs = new TikaDocumentReader(resource)
.get().stream()
.peek(d -> d.getMetadata().put("source", file))
.toList();
List<Document> result = enricher.apply(docs);
return ResponseEntity.ok(result);
}
}Create a prompt from each document’s content.
Send the prompt to the provided ChatModel to generate keywords.
Insert the generated keywords into the document’s metadata under the key excerpt_keywords.
Return the enriched documents.
DocumentWriter
FileDocumentWriter
ClassPathResource resource = new ClassPathResource("spring-source.docx");
TokenTextSplitter textSplitter = new TokenTextSplitter();
File file = resource.getFile();
List<Document> docs = new TikaDocumentReader(resource)
.get().stream()
.peek(d -> d.getMetadata().put("source", file))
.toList();
List<Document> result = textSplitter.apply(docs);
FileDocumentWriter writer = new FileDocumentWriter("d:\\output.txt", true, MetadataMode.ALL, false);
writer.accept(result);VectorStoreDocumentWriter
Spring AI also provides VectorStoreDocumentWriter for persisting documents to a vector database; usage details are available in the Spring AI documentation.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Spring Full-Stack Practical Cases
Full-stack Java development with Vue 2/3 front-end suite; hands-on examples and source code analysis for Spring, Spring Boot 2/3, and Spring Cloud.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
