Comprehensive Evaluation of Long‑Audio Speech‑to‑Text Services from Major Cloud Providers
This article presents a systematic, multi‑dimensional benchmark of six leading cloud speech‑recognition platforms—Alibaba Cloud, Tencent Cloud, iFlytek, Baidu Cloud, Huawei Cloud, and Microsoft Azure—using a 22.6‑hour, 81‑file Mandarin dataset, scoring with the CORR metric and SCTK tool, and discusses each provider's workflow, strengths, pitfalls, and cost.
The author initiated the evaluation after a friend asked for a cloud platform capable of transcribing long Mandarin audio recordings (dialogue, telephone, or solo speech) in a single request and receiving callbacks. Six major providers were selected: Alibaba Cloud, Tencent Cloud, iFlytek, Baidu Cloud, Huawei Cloud, and Microsoft Azure.
Preparation involved two parts: gathering the providers' API documentation (links to each service’s technical guide) and assembling a test dataset. The dataset consists of five publicly available Mandarin corpora, all featuring long, conversational audio, totaling 22.6 hours across 81 files (16 kHz for most, 8 kHz for telephone recordings).
Testing standards use character‑level correctness (CORR) rather than word‑error‑rate, because long audio introduces silence, noise, and ambiguous boundaries that make traditional WER/CER less reliable. Scoring is performed with the SCTK (sclite) tool; the source code was recompiled with the MAXSTRING limit raised from 10 000 to 1 000 000 characters to handle long transcripts.
Each provider’s API was called asynchronously (submit + query) via Python scripts. The author merged submission and polling into a single script, removed punctuation from results to match reference texts, and recorded costs. Notable observations include:
Alibaba Cloud offered the smoothest experience, with fast responses, minimal errors, and a total cost of about ¥58 for the full test.
Tencent Cloud provided an online API Explorer, 10 hours of free usage, and incurred roughly ¥50.60 at an average rate of ¥4 per hour.
iFlytek required pre‑purchased packages (20 hours for ¥168) and did not differentiate 8 kHz/16 kHz, simplifying usage.
Baidu Cloud’s billing is transparent, but a missed sampling‑rate parameter caused a major scoring error on the 8 kHz subset.
Huawei Cloud also uses an API Explorer but ties billing to the chosen endpoint region, leading to unexpected charges.
Microsoft Azure offers a single “speech‑to‑text” endpoint, well‑documented but lacking demo code, and its performance was affected by network latency.
Overall scoring shows Alibaba Cloud achieving the highest CORR across all five datasets, while Microsoft, Tencent, and iFlytek form a second tier, and Baidu and Huawei fall into a third tier.
All scripts, tools, and dataset links are provided in a public GitHub repository for reproducibility.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.