Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition
This article presents a complete design and implementation of a high‑throughput, asynchronous OCR pipeline built with Spring Boot and Tesseract, covering distributed architecture, thread‑pool tuning, image‑preprocessing, multi‑engine recognition, data extraction strategies, Kubernetes deployment, security compliance, chaos testing, and future AI‑driven enhancements.
1. System Architecture Design
The solution adopts a distributed pipeline where invoice files are ingested, pre‑processed, OCR‑processed, and finally extracted into structured data. Core components include a Spring Boot service layer, asynchronous executors, Tesseract OCR, image‑preprocessing, hybrid recognition, and a data‑extraction engine.
2. Spring Boot Asynchronous Framework
@Configuration
@EnableAsync
public class AsyncConfig {
@Bean("ocrExecutor")
public Executor ocrTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(20);
executor.setMaxPoolSize(50);
executor.setQueueCapacity(1000);
executor.setThreadNamePrefix("OCR-Thread-");
executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
executor.initialize();
return executor;
}
@Bean("ioExecutor")
public Executor ioTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(50);
executor.setMaxPoolSize(200);
executor.setQueueCapacity(5000);
executor.setThreadNamePrefix("IO-Thread-");
executor.initialize();
return executor;
}
}Service methods are annotated with @Async to run on the dedicated executors:
@Service
public class InvoiceProcessingService {
@Async("ioExecutor")
public CompletableFuture<BufferedImage> preprocessInvoice(MultipartFile file) { /* type detection, storage, conversion, enhancement */ }
@Async("ocrExecutor")
public CompletableFuture<OcrResult> performOcr(File image) { /* Tesseract init and OCR */ }
@Async("ioExecutor")
public CompletableFuture<InvoiceData> extractData(OcrResult ocrResult) { /* regex extraction, ML validation */ }
} @RestController
@RequestMapping("/invoice")
public class InvoiceController {
@PostMapping("/process")
public ResponseEntity<ProcessResponse> processInvoice(@RequestParam("file") MultipartFile file) {
String taskId = UUID.randomUUID().toString();
CompletableFuture.supplyAsync(() -> preprocessService.preprocessInvoice(file))
.thenCompose(preprocessService::performOcr)
.thenCompose(extractionService::extractData)
.thenAccept(data -> {
storageService.saveResult(taskId, data);
notificationService.notifyClient(taskId, data);
})
.exceptionally(ex -> { errorService.logError(taskId, ex); return null; });
return ResponseEntity.accepted().body(new ProcessResponse(taskId, "Processing started"));
}
}3. Tesseract Deep Optimization
A custom invoice‑specific training model is built using the standard Tesseract training workflow (box generation, feature training, clustering, combine). Image preprocessing includes grayscale conversion, adaptive thresholding, non‑local means denoising, line enhancement, and deskewing:
public class ImagePreprocessor {
public BufferedImage preprocess(BufferedImage original) {
BufferedImage gray = toGrayscale(original);
BufferedImage binary = adaptiveThreshold(gray);
BufferedImage denoised = denoise(binary);
BufferedImage enhanced = enhanceLines(denoised);
return deskew(enhanced);
}
private BufferedImage adaptiveThreshold(BufferedImage gray) { /* OpenCV ADAPTIVE_THRESH_GAUSSIAN_C */ }
private BufferedImage denoise(BufferedImage img) { /* fastNlMeansDenoising */ }
}Hybrid OCR service selects the best engine per region (table, handwriting, or generic) and merges results:
public class HybridOcrService {
public String recognize(File image) {
List<Region> regions = segmentRegions(image);
return regions.stream().map(region -> {
if (isTableRegion(region)) return tableOcrEngine.recognize(region);
if (isHandwritingRegion(region)) return handwritingEngine.recognize(region);
return tesseract.recognize(region);
}).collect(Collectors.joining("
"));
}
}4. Structured Data Extraction
A pluggable extraction engine runs a sequence of strategies (regex, positional, ML‑based) and stops early when the invoice data is complete:
public class DataExtractionEngine {
private final List<ExtractionStrategy> strategies = Arrays.asList(
new RegexStrategy(), new PositionalStrategy(), new MLBasedStrategy());
public InvoiceData extract(String ocrText) {
InvoiceData result = new InvoiceData();
for (ExtractionStrategy s : strategies) {
s.extract(ocrText, result);
if (result.isComplete()) break;
}
return result;
}
}The regex strategy uses patterns such as "发票号码[::]\s*(\w{8,12})" to capture invoice number, date, and total amount.
5. Performance Optimisation
Scaling is achieved with a Kubernetes deployment of ten OCR worker replicas, GPU‑accelerated Tesseract, and caching layers. Example deployment manifest:
# ocr-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ocr-worker
spec:
replicas: 10
selector:
matchLabels:
app: ocr-worker
template:
metadata:
labels:
app: ocr-worker
spec:
containers:
- name: ocr
image: ocr-service:3.0
resources:
limits:
nvidia.com/gpu: 1
memory: 8Gi
requests:
memory: 4Gi
env:
- name: TESSDATA_PREFIX
value: /tessdata
volumeMounts:
- name: tessdata
mountPath: /tessdata
volumes:
- name: tessdata
persistentVolumeClaim:
claimName: tessdata-pvc
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High‑priority GPU tasks"6. Production‑Ready Concerns
Security and compliance are addressed with GDPR‑style PII detection, Chinese fiscal regulations support, audit logging, and optional blockchain notarisation of key operations.
7. Testing and Validation
Chaos engineering validates resilience under latency injection (500‑2000 ms) and a 10 % error rate while running a 1000‑concurrent load test. The test asserts that the observed error rate stays below 5 %:
public class ChaosTest {
@Test
public void testOcrPipelineResilience() {
ChaosMonkey.enable()
.latency(500, 2000)
.exceptionRate(0.1)
.enable();
loadTester.run(1000);
assertTrue("Error rate < 5%", errorRate < 0.05);
ChaosMonkey.disable();
}
}Accuracy is measured with a verification matrix (shown in the original figures) and a BERT‑based semantic validator that returns true when confidence exceeds 0.8.
8. Future Evolution
Planned enhancements include a self‑learning OCR model that continuously fine‑tunes on newly processed invoices, cross‑chain blockchain anchoring of invoice hashes (Hyperledger/Ethereum), and intelligent audit modules for anomaly detection and tax‑risk alerts.
SpringMeng
Focused on software development, sharing source code and tutorials for various systems.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
