Artificial Intelligence 16 min read

How Tencent’s Zixiao AI Chip Supercharges Real‑Time Meeting Subtitles

Tencent’s home‑grown Zixiao AI inference chip, combined with the LightRuntime engine, dramatically reduces latency and cost for real‑time subtitles in Tencent Meeting, handling tens of thousands of concurrent audio streams while meeting sub‑second delay requirements through hardware‑software co‑optimizations and mixed‑precision model tuning.

Tencent Tech

Apr 18, 2023

How Tencent’s Zixiao AI Chip Supercharges Real‑Time Meeting Subtitles

Accelerating Tencent’s Self‑Developed Chip Portfolio

Tencent is rapidly advancing its custom silicon, including the video codec chip “Canghai” (mass‑produced for cloud gaming), the high‑performance network chip “Xuanling” (delivering zero CPU usage and 4× performance), and the AI inference chip “Zixiao”.

Zixiao in Real‑Time Subtitles for Tencent Meeting

Zixiao has been mass‑produced and deployed across Tencent’s flagship services. In Tencent Meeting, it powers real‑time personalized subtitles, achieving a single‑card performance equivalent to four NVIDIA T4 GPUs and reducing timeout rates from 0.005% to zero.

Technical Challenges of Real‑Time Subtitles

During peak periods, subtitle services must handle over 100,000 concurrent streams with end‑to‑end latency under 1 second. The system must keep per‑utterance processing under 2 seconds, otherwise the segment is dropped. High concurrency stresses CPU, GPU, and network resources.

Optimization Strategies on Zixiao

Instant (Transient) Module Acceleration : Previously run on CPU, the instant models were fine‑tuned to remove dynamic components, reducing memory usage and moving inference to Zixiao without accuracy loss.

Steady (Stable) Module Acceleration : Acoustic and rescoring models were ported to Zixiao and scheduled via the custom LightRuntime runtime, maximizing chip utilization.

Thread Framework Optimization : Batch processing threads were replaced with LightRuntime’s group‑batch and scheduling capabilities, eliminating redundant batch threads.

Model Micro‑Tuning to Eliminate Padding Effects

Dynamic input shapes caused padding‑induced errors in position embeddings and attention. The solution introduced a real_length input and a binary Mask to isolate valid frames, allowing static‑shape inference and preserving accuracy.

Key steps:

Query padded subsample cache rows separately.

Query acoustic feature rows using real_length as the start index.

Concatenate the two results as conformer block input.

Apply a mask after softmax to zero out padded positions.

Performance Gains

Moving the instant middle‑frame model to Zixiao reduced 128‑stream latency from ~1200 ms to 10 ms and cut CPU usage dramatically. The first‑frame module’s session pool lowered 400‑stream latency from 249 ms to 30 ms.

Mixed‑precision inference identified overflow‑prone layers (e.g., MatMul, Mul) and kept them in FP32, achieving a good trade‑off between speed and accuracy.

LightRuntime Engine Features

LightRuntime provides:

AutoBatch : Dynamically aggregates single‑request batches to improve throughput (≈20% gain for acoustic models).

AutoPadding : Automatic bucketing and padding to optimal tensor shapes, reducing DTU cost.

Multi‑Model Scheduling : Multiple sessions handle different ONNX models, with priority scheduling for high‑importance models.

These capabilities enable Zixiao to handle the full subtitle traffic, covering over 95% of meeting subtitle volume with zero timeout and up to 75% cost savings in extreme scenarios.

Overall Solution

Zixiao accelerates both transient and steady decoding pipelines. LightRuntime’s ease of integration required minimal code changes, and the combined system delivers sub‑second latency, high concurrency, and significant cost reductions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

hardware acceleration Tencent Meeting Real-time Speech Recognition

Written by

Tencent Tech

Tencent's official tech account. Delivering quality technical content to serve developers.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.