How Tencent’s Zixiao AI Chip Supercharges Real‑Time Meeting Subtitles
Tencent’s home‑grown Zixiao AI inference chip, combined with the LightRuntime engine, dramatically reduces latency and cost for real‑time subtitles in Tencent Meeting, handling tens of thousands of concurrent audio streams while meeting sub‑second delay requirements through hardware‑software co‑optimizations and mixed‑precision model tuning.
Accelerating Tencent’s Self‑Developed Chip Portfolio
Tencent is rapidly advancing its custom silicon, including the video codec chip “Canghai” (mass‑produced for cloud gaming), the high‑performance network chip “Xuanling” (delivering zero CPU usage and 4× performance), and the AI inference chip “Zixiao”.
Zixiao in Real‑Time Subtitles for Tencent Meeting
Zixiao has been mass‑produced and deployed across Tencent’s flagship services. In Tencent Meeting, it powers real‑time personalized subtitles, achieving a single‑card performance equivalent to four NVIDIA T4 GPUs and reducing timeout rates from 0.005% to zero.
Technical Challenges of Real‑Time Subtitles
During peak periods, subtitle services must handle over 100,000 concurrent streams with end‑to‑end latency under 1 second. The system must keep per‑utterance processing under 2 seconds, otherwise the segment is dropped. High concurrency stresses CPU, GPU, and network resources.
Optimization Strategies on Zixiao
Instant (Transient) Module Acceleration : Previously run on CPU, the instant models were fine‑tuned to remove dynamic components, reducing memory usage and moving inference to Zixiao without accuracy loss.
Steady (Stable) Module Acceleration : Acoustic and rescoring models were ported to Zixiao and scheduled via the custom LightRuntime runtime, maximizing chip utilization.
Thread Framework Optimization : Batch processing threads were replaced with LightRuntime’s group‑batch and scheduling capabilities, eliminating redundant batch threads.
Model Micro‑Tuning to Eliminate Padding Effects
Dynamic input shapes caused padding‑induced errors in position embeddings and attention. The solution introduced a
real_lengthinput and a binary
Maskto isolate valid frames, allowing static‑shape inference and preserving accuracy.
Key steps:
Query padded subsample cache rows separately.
Query acoustic feature rows using
real_lengthas the start index.
Concatenate the two results as conformer block input.
Apply a mask after softmax to zero out padded positions.
Performance Gains
Moving the instant middle‑frame model to Zixiao reduced 128‑stream latency from ~1200 ms to 10 ms and cut CPU usage dramatically. The first‑frame module’s session pool lowered 400‑stream latency from 249 ms to 30 ms.
Mixed‑precision inference identified overflow‑prone layers (e.g., MatMul, Mul) and kept them in FP32, achieving a good trade‑off between speed and accuracy.
LightRuntime Engine Features
LightRuntime provides:
AutoBatch : Dynamically aggregates single‑request batches to improve throughput (≈20% gain for acoustic models).
AutoPadding : Automatic bucketing and padding to optimal tensor shapes, reducing DTU cost.
Multi‑Model Scheduling : Multiple sessions handle different ONNX models, with priority scheduling for high‑importance models.
These capabilities enable Zixiao to handle the full subtitle traffic, covering over 95% of meeting subtitle volume with zero timeout and up to 75% cost savings in extreme scenarios.
Overall Solution
Zixiao accelerates both transient and steady decoding pipelines. LightRuntime’s ease of integration required minimal code changes, and the combined system delivers sub‑second latency, high concurrency, and significant cost reductions.
Tencent Tech
Tencent's official tech account. Delivering quality technical content to serve developers.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.