Artificial Intelligence 13 min read

Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput

Tencent Cloud’s OCR team cut average response time from 1.8 seconds to under one second and boosted throughput by over 50 % by redesigning the model with self‑attention, accelerating inference with a Tensor‑Network accelerator, shrinking RPC payloads, enabling asynchronous logging, and optimizing multi‑region GPU memory utilization.

Tencent Cloud Developer

Dec 12, 2022

Performance Optimization of Tencent Cloud OCR Service: Reducing Latency and Improving Throughput

This article presents a comprehensive case study of how the Tencent Cloud OCR team systematically reduced service latency and increased throughput through a combination of model, inference‑engine, and engineering optimizations.

Background: Although the OCR service already achieved high availability and accuracy, customers reported noticeable latency, especially when requests originated from Beijing and were processed in Guangzhou. The team set a goal to cut the average end‑to‑end response time from 1815 ms to below 1 s.

Challenges: The optimization effort had to address multiple stages (client transmission, cloud API, business‑logic processing, and algorithm engine), a short time window, and cost constraints.

Analysis: Detailed measurements showed that the engine stage contributed the largest share of latency, followed by client transmission and business‑logic processing. The OCR pipeline involved micro‑services communicating via TRPC, heavy image transfer, and complex model inference.

Optimization Measures:

Model optimization: Introduced a TSA (Text‑Sequence‑Attention) model that replaces BiLSTM‑based sequence modeling with multi‑head self‑attention, reducing the computational complexity from O(N) to O(1) and achieving up to 2× speed‑up for long texts.

Inference acceleration: Deployed TI‑ACC, a Tensor‑Network‑Accelerator built on the TNN framework, to accelerate both detection and recognition models. Benchmarks showed FP16 acceleration factors up to 3× compared with libtorch.

Engineering improvements: Optimized codec logic to reduce RPC payload size, implemented asynchronous log‑splitting to avoid blocking the main request path, and performed near‑site multi‑region deployment to cut client‑side transmission delays.

GPU memory management: Decoupled memory usage from concurrency, introduced shared‑memory mechanisms, and balanced model parallelism, raising GPU utilization from ~35 % to ~85 % and increasing TPS by 51.4 %.

Results: After the full‑stack optimization, the average OCR latency dropped from 1815 ms to 824 ms (a 54.6 % reduction). TPS increased by 51.4 %, recall improved by 1.1 % and accuracy by 2.3 %. Cost savings were realized proportionally to the TPS gain.

Conclusion: The end‑to‑end optimization demonstrates that systematic analysis and targeted improvements across model, inference engine, and service architecture can dramatically enhance AI‑driven cloud services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Services performance optimization OCR Inference Acceleration AI model Latency Reduction

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.