Build a Scalable High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract
This article presents a comprehensive, high‑throughput OCR invoice processing solution that combines distributed system design, Spring Boot asynchronous execution, Tesseract deep optimization, multi‑engine fusion, structured data extraction, performance tuning, Kubernetes deployment, and security compliance.
System Architecture Design
Distributed pipeline architecture for scalable processing
Core components: API Gateway, File Pre‑processing, OCR Engine, Data Extraction, Message Queue, Storage System
Data flow design ensuring end‑to‑end throughput and reliability
1.1 Distributed Pipeline Architecture
1.2 Core Component Responsibilities
API Gateway (Spring Cloud Gateway): request routing, rate limiting – supports 5000+ TPS
File Pre‑processing (OpenCV + ImageMagick): format conversion, denoising, enhancement – 100 ms per image
OCR Engine (Tesseract 5.3): text recognition – average 1.5 s per page
Data Extraction (Rule Engine + ML Model): structured data extraction – accuracy >96 %
Message Queue (RabbitMQ): task distribution, spike smoothing – >100 k messages/sec
Storage System (MinIO + MySQL): file and metadata storage – PB‑level capacity
1.3 Data Flow Design
Spring Boot Asynchronous Framework Implementation
2.1 Thread Pool Optimization
@Configuration
@EnableAsync
public class AsyncConfig {
@Bean("ocrExecutor")
public Executor ocrTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(20);
executor.setMaxPoolSize(50);
executor.setQueueCapacity(1000);
executor.setThreadNamePrefix("OCR-Thread-");
executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
executor.initialize();
return executor;
}
@Bean("ioExecutor")
public Executor ioTaskExecutor() {
ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
executor.setCorePoolSize(50);
executor.setMaxPoolSize(200);
executor.setQueueCapacity(5000);
executor.setThreadNamePrefix("IO-Thread-");
executor.initialize();
return executor;
}
}2.2 Asynchronous Service Layer
@Service
public class InvoiceProcessingService {
@Async("ioExecutor")
public CompletableFuture<File> preprocessInvoice(MultipartFile file) {
// 1. Detect file type
String contentType = file.getContentType();
if (!SUPPORTED_TYPES.contains(contentType)) {
throw new UnsupportedFileTypeException();
}
// 2. Store raw file
Path rawPath = storageService.store(file);
// 3. Convert format (e.g., PDF → JPG)
Path processedPath = imageConverter.convert(rawPath);
// 4. Image enhancement
BufferedImage enhancedImage = imageEnhancer.enhance(processedPath);
return CompletableFuture.completedFuture(enhancedImage);
}
@Async("ocrExecutor")
public CompletableFuture<OcrResult> performOcr(File image) {
Tesseract tesseract = new Tesseract();
tesseract.setDatapath("/tessdata");
tesseract.setLanguage("chi_sim+eng");
tesseract.setPageSegMode(TessPageSegMode.PSM_AUTO);
String text = tesseract.doOCR(image);
double confidence = tesseract.getWords().stream()
.mapToDouble(Word::getConfidence)
.average().orElse(0);
return CompletableFuture.completedFuture(new OcrResult(text, confidence));
}
@Async("ioExecutor")
public CompletableFuture<InvoiceData> extractData(OcrResult ocrResult) {
InvoiceData data = regexExtractor.extract(ocrResult.getText());
if (dataValidator.requiresMlCheck(data)) {
data = mlValidator.validate(data);
}
data.setOcrConfidence(ocrResult.getConfidence());
data.setProcessingTime(System.currentTimeMillis());
return CompletableFuture.completedFuture(data);
}
}2.3 Asynchronous Pipeline Orchestration
@RestController
@RequestMapping("/invoice")
public class InvoiceController {
@PostMapping("/process")
public ResponseEntity<ProcessResponse> processInvoice(@RequestParam("file") MultipartFile file) {
String taskId = UUID.randomUUID().toString();
CompletableFuture.supplyAsync(() -> preprocessService.preprocessInvoice(file))
.thenCompose(preprocessService::performOcr)
.thenCompose(extractionService::extractData)
.thenAccept(data -> {
storageService.saveResult(taskId, data);
notificationService.notifyClient(taskId, data);
})
.exceptionally(ex -> {
errorService.logError(taskId, ex);
return null;
});
return ResponseEntity.accepted().body(new ProcessResponse(taskId, "Processing started"));
}
}Tesseract Deep Optimization
3.1 Invoice‑Specific Training Model
Training workflow includes generating box files, training font features, extracting character sets, clustering, and combining data into a final model.
# Generate BOX file
tesseract invoice_001.png invoice_001 -l chi_sim batch.nochop makebox
# Train font features
tesseract invoice_001.png invoice_001 nobatch box.train
# Extract character set
unicharset_extractor invoice_001.box
# Cluster features
shapeclustering -F font_properties -U unicharset invoice_001.tr
# Combine into final model
combine_tessdata invoice.3.2 Image Pre‑processing Enhancement
public class ImagePreprocessor {
public BufferedImage preprocess(BufferedImage original) {
BufferedImage gray = toGrayscale(original);
BufferedImage binary = adaptiveThreshold(gray);
BufferedImage denoised = denoise(binary);
BufferedImage enhanced = enhanceLines(denoised);
return deskew(enhanced);
}
private BufferedImage adaptiveThreshold(BufferedImage gray) {
Mat src = bufferedImageToMat(gray);
Mat dst = new Mat();
Imgproc.adaptiveThreshold(src, dst, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C,
Imgproc.THRESH_BINARY, 11, 2);
return matToBufferedImage(dst);
}
private BufferedImage denoise(BufferedImage image) {
Mat src = bufferedImageToMat(image);
Mat dst = new Mat();
Photo.fastNlMeansDenoising(src, dst, 30, 7, 21);
return matToBufferedImage(dst);
}
}3.3 Multi‑Engine Fusion Recognition
public class HybridOcrService {
public String recognize(File image) {
List<BufferedImage> regions = segmentRegions(image);
return regions.stream()
.map(region -> {
if (isTableRegion(region)) {
return tableOcrEngine.recognize(region);
} else if (isHandwritingRegion(region)) {
return handwritingEngine.recognize(region);
} else {
return tesseract.recognize(region);
}
})
.collect(Collectors.joining("
"));
}
private boolean isTableRegion(BufferedImage image) {
Mat mat = bufferedImageToMat(image);
Mat lines = new Mat();
Imgproc.HoughLinesP(mat, lines, 1, Math.PI/180, 50, 50, 10);
return lines.rows() > 5;
}
}Structured Data Extraction
4.1 Multi‑Strategy Extraction Framework
public class DataExtractionEngine {
private final List<ExtractionStrategy> strategies = Arrays.asList(
new RegexStrategy(),
new PositionalStrategy(),
new MLBasedStrategy()
);
public InvoiceData extract(String ocrText) {
InvoiceData result = new InvoiceData();
for (ExtractionStrategy strategy : strategies) {
strategy.extract(ocrText, result);
if (result.isComplete()) {
break;
}
}
return result;
}
}4.2 Regex & Rule Engine
public class RegexStrategy implements ExtractionStrategy {
private static final Map<String, Pattern> PATTERNS = Map.of(
"invoiceNumber", Pattern.compile("发票号码[::]\s*(\\w{8,12})"),
"invoiceDate", Pattern.compile("开票日期[::]\s*(\\d{4}年\\d{2}月\\d{2}日)"),
"totalAmount", Pattern.compile("合计金额[::]\s*(¥?\\d+\\.\\d{2})")
);
@Override
public void extract(String text, InvoiceData data) {
for (Map.Entry<String, Pattern> entry : PATTERNS.entrySet()) {
Matcher matcher = entry.getValue().matcher(text);
if (matcher.find()) {
setDataField(data, entry.getKey(), matcher.group(1));
}
}
}
}4.3 Machine‑Learning Validation Model
# BERT based semantic validation
from transformers import BertTokenizer, BertForSequenceClassification
class InvoiceValidator:
def __init__(self):
self.tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
self.model = BertForSequenceClassification.from_pretrained('invoice-validator')
def validate(self, field, value, context):
prompt = f"发票{field}是{value},上下文:{context}"
inputs = self.tokenizer(prompt, return_tensors="pt")
outputs = self.model(**inputs)
logits = outputs.logits
return torch.softmax(logits, dim=1)[0][1].item() > 0.8 # confidence thresholdPerformance Optimization Strategies
5.1 Distributed OCR Cluster
5.2 Cache Optimization
Image preprocessing results – Redis – hit rate 40‑60 % – reduces processing time by 30 %
OCR recognition results – Caffeine – hit rate 25‑35 % – cuts OCR calls by 50 %
Template matching rules – Hazelcast – hit rate 70‑80 % – speeds extraction 3×
5.3 Hardware Acceleration
public class GpuOcrEngine {
public String recognize(BufferedImage image) {
CUDA.setDevice(0);
CUdeviceptr imagePtr = convertToGpuBuffer(image);
preprocessOnGpu(imagePtr);
return tesseractGpu.recognize(imagePtr);
}
}Production Environment Deployment
6.1 Kubernetes Deployment
# ocr-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: ocr-worker
spec:
replicas: 10
selector:
matchLabels:
app: ocr-worker
template:
metadata:
labels:
app: ocr-worker
spec:
containers:
- name: ocr
image: ocr-service:3.0
resources:
limits:
nvidia.com/gpu: 1
memory: 8Gi
requests:
memory: 4Gi
env:
- name: TESSDATA_PREFIX
value: /tessdata
volumeMounts:
- name: tessdata
mountPath: /tessdata
volumes:
- name: tessdata
persistentVolumeClaim:
claimName: tessdata-pvc
---
# GPU node priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High‑priority GPU tasks"6.2 Monitoring & Alerting
# Prometheus metrics
- name: ocr_processing_time
type: histogram
help: OCR processing time distribution
buckets: [0.5, 1, 2, 5, 10]
- name: extraction_accuracy
type: gauge
help: Field extraction accuracy
# Grafana panel example
panel:
title: System Throughput
type: graph
datasource: prometheus
targets:
- expr: sum(rate(ocr_processed_total[5m]))
legend: Processing SpeedSecurity and Compliance
7.1 Data Security Architecture
7.2 Compliance Design
GDPR compliance : automatic PII detection, data erasure API
Financial compliance : conforms to Chinese electronic invoice regulations, supports tax authority verification
Audit tracing : full‑process operation logs, blockchain notarization of critical actions
Testing and Validation
8.1 Chaos Engineering Test
public class ChaosTest {
@Test
public void testOcrPipelineResilience() {
// Simulate latency 500‑2000 ms and 10 % error rate
ChaosMonkey.enable()
.latency(500, 2000)
.exceptionRate(0.1)
.enable();
// Run 1000‑concurrent load test
loadTester.run(1000);
// Verify error rate < 5 %
assertTrue("Error rate < 5%", errorRate < 0.05);
ChaosMonkey.disable();
}
}8.2 Accuracy Verification Matrix
VAT ordinary invoice – 10,000 samples – OCR accuracy 98.7 % – field accuracy 96.2 %
VAT special invoice – 8,500 samples – OCR accuracy 97.5 % – field accuracy 95.8 %
E‑invoice – 12,000 samples – OCR accuracy 99.1 % – field accuracy 97.3 %
Handwritten invoice – 3,000 samples – OCR accuracy 85.2 % – field accuracy 79.6 %
Evolution and Future Directions
9.1 Intelligent Evolution Paths
Self‑learning OCR: continuous model refinement on new invoices
Cross‑chain notarization: hash invoices onto Hyperledger/Ethereum for legal evidence
Intelligent audit: anomaly detection and tax‑risk warnings via AI
9.2 Performance Evolution Goals
Processing speed: current 2.5 s/page → target 0.8 s/page (FPGA acceleration)
Accuracy: current 96 % → target 99.5 % (integrate PaddleOCR)
Concurrency: current 100 pages/s → target 500 pages/s (distributed cluster)
Conclusion
This solution builds a high‑performance OCR invoice pipeline using Tesseract and Spring Boot asynchronous processing, leveraging distributed architecture, GPU acceleration, intelligent extraction, and robust DevOps practices to achieve million‑scale daily throughput with high availability, accuracy, and compliance.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
