Artificial Intelligence 16 min read

Build a Scalable High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract

This article presents a comprehensive, high‑throughput OCR invoice processing solution that combines distributed system design, Spring Boot asynchronous execution, Tesseract deep optimization, multi‑engine fusion, structured data extraction, performance tuning, Kubernetes deployment, and security compliance.

Architect

Aug 16, 2025

Build a Scalable High‑Performance OCR Invoice Pipeline with Spring Boot & Tesseract

System Architecture Design

Distributed pipeline architecture for scalable processing

Core components: API Gateway, File Pre‑processing, OCR Engine, Data Extraction, Message Queue, Storage System

Data flow design ensuring end‑to‑end throughput and reliability

1.1 Distributed Pipeline Architecture

1.2 Core Component Responsibilities

API Gateway (Spring Cloud Gateway): request routing, rate limiting – supports 5000+ TPS

File Pre‑processing (OpenCV + ImageMagick): format conversion, denoising, enhancement – 100 ms per image

OCR Engine (Tesseract 5.3): text recognition – average 1.5 s per page

Data Extraction (Rule Engine + ML Model): structured data extraction – accuracy >96 %

Message Queue (RabbitMQ): task distribution, spike smoothing – >100 k messages/sec

Storage System (MinIO + MySQL): file and metadata storage – PB‑level capacity

1.3 Data Flow Design

Spring Boot Asynchronous Framework Implementation

2.1 Thread Pool Optimization

@Configuration
@EnableAsync
public class AsyncConfig {
    @Bean("ocrExecutor")
    public Executor ocrTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(20);
        executor.setMaxPoolSize(50);
        executor.setQueueCapacity(1000);
        executor.setThreadNamePrefix("OCR-Thread-");
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        executor.initialize();
        return executor;
    }

    @Bean("ioExecutor")
    public Executor ioTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(50);
        executor.setMaxPoolSize(200);
        executor.setQueueCapacity(5000);
        executor.setThreadNamePrefix("IO-Thread-");
        executor.initialize();
        return executor;
    }
}

2.2 Asynchronous Service Layer

@Service
public class InvoiceProcessingService {
    @Async("ioExecutor")
    public CompletableFuture<File> preprocessInvoice(MultipartFile file) {
        // 1. Detect file type
        String contentType = file.getContentType();
        if (!SUPPORTED_TYPES.contains(contentType)) {
            throw new UnsupportedFileTypeException();
        }
        // 2. Store raw file
        Path rawPath = storageService.store(file);
        // 3. Convert format (e.g., PDF → JPG)
        Path processedPath = imageConverter.convert(rawPath);
        // 4. Image enhancement
        BufferedImage enhancedImage = imageEnhancer.enhance(processedPath);
        return CompletableFuture.completedFuture(enhancedImage);
    }

    @Async("ocrExecutor")
    public CompletableFuture<OcrResult> performOcr(File image) {
        Tesseract tesseract = new Tesseract();
        tesseract.setDatapath("/tessdata");
        tesseract.setLanguage("chi_sim+eng");
        tesseract.setPageSegMode(TessPageSegMode.PSM_AUTO);
        String text = tesseract.doOCR(image);
        double confidence = tesseract.getWords().stream()
                .mapToDouble(Word::getConfidence)
                .average().orElse(0);
        return CompletableFuture.completedFuture(new OcrResult(text, confidence));
    }

    @Async("ioExecutor")
    public CompletableFuture<InvoiceData> extractData(OcrResult ocrResult) {
        InvoiceData data = regexExtractor.extract(ocrResult.getText());
        if (dataValidator.requiresMlCheck(data)) {
            data = mlValidator.validate(data);
        }
        data.setOcrConfidence(ocrResult.getConfidence());
        data.setProcessingTime(System.currentTimeMillis());
        return CompletableFuture.completedFuture(data);
    }
}

2.3 Asynchronous Pipeline Orchestration

@RestController
@RequestMapping("/invoice")
public class InvoiceController {
    @PostMapping("/process")
    public ResponseEntity<ProcessResponse> processInvoice(@RequestParam("file") MultipartFile file) {
        String taskId = UUID.randomUUID().toString();
        CompletableFuture.supplyAsync(() -> preprocessService.preprocessInvoice(file))
                .thenCompose(preprocessService::performOcr)
                .thenCompose(extractionService::extractData)
                .thenAccept(data -> {
                    storageService.saveResult(taskId, data);
                    notificationService.notifyClient(taskId, data);
                })
                .exceptionally(ex -> {
                    errorService.logError(taskId, ex);
                    return null;
                });
        return ResponseEntity.accepted().body(new ProcessResponse(taskId, "Processing started"));
    }
}

Tesseract Deep Optimization

3.1 Invoice‑Specific Training Model

Training workflow includes generating box files, training font features, extracting character sets, clustering, and combining data into a final model.

# Generate BOX file
 tesseract invoice_001.png invoice_001 -l chi_sim batch.nochop makebox
# Train font features
 tesseract invoice_001.png invoice_001 nobatch box.train
# Extract character set
 unicharset_extractor invoice_001.box
# Cluster features
 shapeclustering -F font_properties -U unicharset invoice_001.tr
# Combine into final model
 combine_tessdata invoice.

3.2 Image Pre‑processing Enhancement

public class ImagePreprocessor {
    public BufferedImage preprocess(BufferedImage original) {
        BufferedImage gray = toGrayscale(original);
        BufferedImage binary = adaptiveThreshold(gray);
        BufferedImage denoised = denoise(binary);
        BufferedImage enhanced = enhanceLines(denoised);
        return deskew(enhanced);
    }
    private BufferedImage adaptiveThreshold(BufferedImage gray) {
        Mat src = bufferedImageToMat(gray);
        Mat dst = new Mat();
        Imgproc.adaptiveThreshold(src, dst, 255, Imgproc.ADAPTIVE_THRESH_GAUSSIAN_C,
                Imgproc.THRESH_BINARY, 11, 2);
        return matToBufferedImage(dst);
    }
    private BufferedImage denoise(BufferedImage image) {
        Mat src = bufferedImageToMat(image);
        Mat dst = new Mat();
        Photo.fastNlMeansDenoising(src, dst, 30, 7, 21);
        return matToBufferedImage(dst);
    }
}

3.3 Multi‑Engine Fusion Recognition

public class HybridOcrService {
    public String recognize(File image) {
        List<BufferedImage> regions = segmentRegions(image);
        return regions.stream()
                .map(region -> {
                    if (isTableRegion(region)) {
                        return tableOcrEngine.recognize(region);
                    } else if (isHandwritingRegion(region)) {
                        return handwritingEngine.recognize(region);
                    } else {
                        return tesseract.recognize(region);
                    }
                })
                .collect(Collectors.joining("
"));
    }
    private boolean isTableRegion(BufferedImage image) {
        Mat mat = bufferedImageToMat(image);
        Mat lines = new Mat();
        Imgproc.HoughLinesP(mat, lines, 1, Math.PI/180, 50, 50, 10);
        return lines.rows() > 5;
    }
}

Structured Data Extraction

4.1 Multi‑Strategy Extraction Framework

public class DataExtractionEngine {
    private final List<ExtractionStrategy> strategies = Arrays.asList(
            new RegexStrategy(),
            new PositionalStrategy(),
            new MLBasedStrategy()
    );
    public InvoiceData extract(String ocrText) {
        InvoiceData result = new InvoiceData();
        for (ExtractionStrategy strategy : strategies) {
            strategy.extract(ocrText, result);
            if (result.isComplete()) {
                break;
            }
        }
        return result;
    }
}

4.2 Regex & Rule Engine

public class RegexStrategy implements ExtractionStrategy {
    private static final Map<String, Pattern> PATTERNS = Map.of(
            "invoiceNumber", Pattern.compile("发票号码[:：]\s*(\\w{8,12})"),
            "invoiceDate", Pattern.compile("开票日期[:：]\s*(\\d{4}年\\d{2}月\\d{2}日)"),
            "totalAmount", Pattern.compile("合计金额[:：]\s*(¥?\\d+\\.\\d{2})")
    );
    @Override
    public void extract(String text, InvoiceData data) {
        for (Map.Entry<String, Pattern> entry : PATTERNS.entrySet()) {
            Matcher matcher = entry.getValue().matcher(text);
            if (matcher.find()) {
                setDataField(data, entry.getKey(), matcher.group(1));
            }
        }
    }
}

4.3 Machine‑Learning Validation Model

# BERT based semantic validation
from transformers import BertTokenizer, BertForSequenceClassification
class InvoiceValidator:
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
        self.model = BertForSequenceClassification.from_pretrained('invoice-validator')
    def validate(self, field, value, context):
        prompt = f"发票{field}是{value}，上下文:{context}"
        inputs = self.tokenizer(prompt, return_tensors="pt")
        outputs = self.model(**inputs)
        logits = outputs.logits
        return torch.softmax(logits, dim=1)[0][1].item() > 0.8  # confidence threshold

Performance Optimization Strategies

5.1 Distributed OCR Cluster

5.2 Cache Optimization

Image preprocessing results – Redis – hit rate 40‑60 % – reduces processing time by 30 %

OCR recognition results – Caffeine – hit rate 25‑35 % – cuts OCR calls by 50 %

Template matching rules – Hazelcast – hit rate 70‑80 % – speeds extraction 3×

5.3 Hardware Acceleration

public class GpuOcrEngine {
    public String recognize(BufferedImage image) {
        CUDA.setDevice(0);
        CUdeviceptr imagePtr = convertToGpuBuffer(image);
        preprocessOnGpu(imagePtr);
        return tesseractGpu.recognize(imagePtr);
    }
}

Production Environment Deployment

6.1 Kubernetes Deployment

# ocr-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ocr-worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: ocr-worker
  template:
    metadata:
      labels:
        app: ocr-worker
    spec:
      containers:
      - name: ocr
        image: ocr-service:3.0
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
          requests:
            memory: 4Gi
        env:
        - name: TESSDATA_PREFIX
          value: /tessdata
        volumeMounts:
        - name: tessdata
          mountPath: /tessdata
      volumes:
      - name: tessdata
        persistentVolumeClaim:
          claimName: tessdata-pvc
---
# GPU node priority class
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High‑priority GPU tasks"

6.2 Monitoring & Alerting

# Prometheus metrics
- name: ocr_processing_time
  type: histogram
  help: OCR processing time distribution
  buckets: [0.5, 1, 2, 5, 10]
- name: extraction_accuracy
  type: gauge
  help: Field extraction accuracy
# Grafana panel example
panel:
  title: System Throughput
  type: graph
  datasource: prometheus
  targets:
    - expr: sum(rate(ocr_processed_total[5m]))
      legend: Processing Speed

Security and Compliance

7.1 Data Security Architecture

7.2 Compliance Design

GDPR compliance : automatic PII detection, data erasure API

Financial compliance : conforms to Chinese electronic invoice regulations, supports tax authority verification

Audit tracing : full‑process operation logs, blockchain notarization of critical actions

Testing and Validation

8.1 Chaos Engineering Test

public class ChaosTest {
    @Test
    public void testOcrPipelineResilience() {
        // Simulate latency 500‑2000 ms and 10 % error rate
        ChaosMonkey.enable()
                .latency(500, 2000)
                .exceptionRate(0.1)
                .enable();
        // Run 1000‑concurrent load test
        loadTester.run(1000);
        // Verify error rate < 5 %
        assertTrue("Error rate < 5%", errorRate < 0.05);
        ChaosMonkey.disable();
    }
}

8.2 Accuracy Verification Matrix

VAT ordinary invoice – 10,000 samples – OCR accuracy 98.7 % – field accuracy 96.2 %

VAT special invoice – 8,500 samples – OCR accuracy 97.5 % – field accuracy 95.8 %

E‑invoice – 12,000 samples – OCR accuracy 99.1 % – field accuracy 97.3 %

Handwritten invoice – 3,000 samples – OCR accuracy 85.2 % – field accuracy 79.6 %

Evolution and Future Directions

9.1 Intelligent Evolution Paths

Self‑learning OCR: continuous model refinement on new invoices

Cross‑chain notarization: hash invoices onto Hyperledger/Ethereum for legal evidence

Intelligent audit: anomaly detection and tax‑risk warnings via AI

9.2 Performance Evolution Goals

Processing speed: current 2.5 s/page → target 0.8 s/page (FPGA acceleration)

Accuracy: current 96 % → target 99.5 % (integrate PaddleOCR)

Concurrency: current 100 pages/s → target 500 pages/s (distributed cluster)

Conclusion

This solution builds a high‑performance OCR invoice pipeline using Tesseract and Spring Boot asynchronous processing, leveraging distributed architecture, GPU acceleration, intelligent extraction, and robust DevOps practices to achieve million‑scale daily throughput with high availability, accuracy, and compliance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI kubernetes OCR Spring Boot security

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.