Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition

This article presents a complete design and implementation of a high‑throughput, asynchronous OCR pipeline built with Spring Boot and Tesseract, covering distributed architecture, thread‑pool tuning, image‑preprocessing, multi‑engine recognition, data extraction strategies, Kubernetes deployment, security compliance, chaos testing, and future AI‑driven enhancements.

SpringMeng
SpringMeng
SpringMeng
Deep Dive into an Asynchronous Spring Boot + Tesseract OCR Pipeline for Invoice Recognition

1. System Architecture Design

The solution adopts a distributed pipeline where invoice files are ingested, pre‑processed, OCR‑processed, and finally extracted into structured data. Core components include a Spring Boot service layer, asynchronous executors, Tesseract OCR, image‑preprocessing, hybrid recognition, and a data‑extraction engine.

2. Spring Boot Asynchronous Framework

@Configuration
@EnableAsync
public class AsyncConfig {
    @Bean("ocrExecutor")
    public Executor ocrTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(20);
        executor.setMaxPoolSize(50);
        executor.setQueueCapacity(1000);
        executor.setThreadNamePrefix("OCR-Thread-");
        executor.setRejectedExecutionHandler(new ThreadPoolExecutor.CallerRunsPolicy());
        executor.initialize();
        return executor;
    }

    @Bean("ioExecutor")
    public Executor ioTaskExecutor() {
        ThreadPoolTaskExecutor executor = new ThreadPoolTaskExecutor();
        executor.setCorePoolSize(50);
        executor.setMaxPoolSize(200);
        executor.setQueueCapacity(5000);
        executor.setThreadNamePrefix("IO-Thread-");
        executor.initialize();
        return executor;
    }
}

Service methods are annotated with @Async to run on the dedicated executors:

@Service
public class InvoiceProcessingService {
    @Async("ioExecutor")
    public CompletableFuture<BufferedImage> preprocessInvoice(MultipartFile file) { /* type detection, storage, conversion, enhancement */ }

    @Async("ocrExecutor")
    public CompletableFuture<OcrResult> performOcr(File image) { /* Tesseract init and OCR */ }

    @Async("ioExecutor")
    public CompletableFuture<InvoiceData> extractData(OcrResult ocrResult) { /* regex extraction, ML validation */ }
}
@RestController
@RequestMapping("/invoice")
public class InvoiceController {
    @PostMapping("/process")
    public ResponseEntity<ProcessResponse> processInvoice(@RequestParam("file") MultipartFile file) {
        String taskId = UUID.randomUUID().toString();
        CompletableFuture.supplyAsync(() -> preprocessService.preprocessInvoice(file))
            .thenCompose(preprocessService::performOcr)
            .thenCompose(extractionService::extractData)
            .thenAccept(data -> {
                storageService.saveResult(taskId, data);
                notificationService.notifyClient(taskId, data);
            })
            .exceptionally(ex -> { errorService.logError(taskId, ex); return null; });
        return ResponseEntity.accepted().body(new ProcessResponse(taskId, "Processing started"));
    }
}

3. Tesseract Deep Optimization

A custom invoice‑specific training model is built using the standard Tesseract training workflow (box generation, feature training, clustering, combine). Image preprocessing includes grayscale conversion, adaptive thresholding, non‑local means denoising, line enhancement, and deskewing:

public class ImagePreprocessor {
    public BufferedImage preprocess(BufferedImage original) {
        BufferedImage gray = toGrayscale(original);
        BufferedImage binary = adaptiveThreshold(gray);
        BufferedImage denoised = denoise(binary);
        BufferedImage enhanced = enhanceLines(denoised);
        return deskew(enhanced);
    }
    private BufferedImage adaptiveThreshold(BufferedImage gray) { /* OpenCV ADAPTIVE_THRESH_GAUSSIAN_C */ }
    private BufferedImage denoise(BufferedImage img) { /* fastNlMeansDenoising */ }
}

Hybrid OCR service selects the best engine per region (table, handwriting, or generic) and merges results:

public class HybridOcrService {
    public String recognize(File image) {
        List<Region> regions = segmentRegions(image);
        return regions.stream().map(region -> {
            if (isTableRegion(region)) return tableOcrEngine.recognize(region);
            if (isHandwritingRegion(region)) return handwritingEngine.recognize(region);
            return tesseract.recognize(region);
        }).collect(Collectors.joining("
"));
    }
}

4. Structured Data Extraction

A pluggable extraction engine runs a sequence of strategies (regex, positional, ML‑based) and stops early when the invoice data is complete:

public class DataExtractionEngine {
    private final List<ExtractionStrategy> strategies = Arrays.asList(
        new RegexStrategy(), new PositionalStrategy(), new MLBasedStrategy());
    public InvoiceData extract(String ocrText) {
        InvoiceData result = new InvoiceData();
        for (ExtractionStrategy s : strategies) {
            s.extract(ocrText, result);
            if (result.isComplete()) break;
        }
        return result;
    }
}

The regex strategy uses patterns such as "发票号码[::]\s*(\w{8,12})" to capture invoice number, date, and total amount.

5. Performance Optimisation

Scaling is achieved with a Kubernetes deployment of ten OCR worker replicas, GPU‑accelerated Tesseract, and caching layers. Example deployment manifest:

# ocr-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ocr-worker
spec:
  replicas: 10
  selector:
    matchLabels:
      app: ocr-worker
  template:
    metadata:
      labels:
        app: ocr-worker
    spec:
      containers:
      - name: ocr
        image: ocr-service:3.0
        resources:
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
          requests:
            memory: 4Gi
        env:
        - name: TESSDATA_PREFIX
          value: /tessdata
        volumeMounts:
        - name: tessdata
          mountPath: /tessdata
      volumes:
      - name: tessdata
        persistentVolumeClaim:
          claimName: tessdata-pvc
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: gpu-high-priority
value: 1000000
globalDefault: false
description: "High‑priority GPU tasks"

6. Production‑Ready Concerns

Security and compliance are addressed with GDPR‑style PII detection, Chinese fiscal regulations support, audit logging, and optional blockchain notarisation of key operations.

7. Testing and Validation

Chaos engineering validates resilience under latency injection (500‑2000 ms) and a 10 % error rate while running a 1000‑concurrent load test. The test asserts that the observed error rate stays below 5 %:

public class ChaosTest {
    @Test
    public void testOcrPipelineResilience() {
        ChaosMonkey.enable()
            .latency(500, 2000)
            .exceptionRate(0.1)
            .enable();
        loadTester.run(1000);
        assertTrue("Error rate < 5%", errorRate < 0.05);
        ChaosMonkey.disable();
    }
}

Accuracy is measured with a verification matrix (shown in the original figures) and a BERT‑based semantic validator that returns true when confidence exceeds 0.8.

8. Future Evolution

Planned enhancements include a self‑learning OCR model that continuously fine‑tunes on newly processed invoices, cross‑chain blockchain anchoring of invoice hashes (Hyperledger/Ethereum), and intelligent audit modules for anomaly detection and tax‑risk alerts.

JavaKubernetesOCRAsynchronousSpring BootGPUTesseractinvoice-processing
SpringMeng
Written by

SpringMeng

Focused on software development, sharing source code and tutorials for various systems.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.