Artificial Intelligence 14 min read

How Huang Xuedong’s Team Achieved Human-Level Speech Recognition at Microsoft

The article chronicles the career of Chinese AI pioneer Huang Xuedong, detailing his education, rise at Microsoft, leadership of Azure AI, groundbreaking human‑level speech recognition breakthroughs, the engineering feats behind them—including a ten‑network model and the CNTK framework—and his recent move to Zoom.

21CTO

Jun 10, 2023

How Huang Xuedong’s Team Achieved Human-Level Speech Recognition at Microsoft

Chinese AI scientist Huang Xuedong recently announced on Twitter that he is leaving Microsoft to join Zoom as CTO.

Early Life and Education

Born in 1962 in Hunan without a high‑school education, Huang entered Hunan University’s teacher class in 1978 at age sixteen, studying electronic engineering as the youngest student in his cohort.

After graduating, he earned a master’s degree in computer science at Tsinghua University in 1982, mentored by AI pioneers Chang Jiong and Fang Ditang, and continued at Tsinghua for a PhD. In 1987 he completed a joint PhD program with the University of Edinburgh, obtaining a doctorate, a master’s, and a bachelor’s degree from Edinburgh, Tsinghua, and Hunan University respectively.

He later became honorary dean and professor of the Software College at his alma mater, Hunan University.

Career at Microsoft

Huang worked at Carnegie Mellon’s School of Computer Science before leading global Microsoft AI teams across the United States, Germany, Egypt, and Israel, developing enterprise AI chatbot solutions and cognitive services such as cris.ai, luis.ai, and the open‑source deep‑learning toolkit CNTK.

In 2016 Wired named him one of the 25 visionaries shaping the future of business. In February 2017 he was elected a Microsoft Technical Fellow, the highest technical honor at Microsoft, making him the first Chinese to reach that level.

He launched Azure Speech in 2015, extending AI from deep‑learning infrastructure to product experiences and achieving several historic AI milestones.

His contributions include over 170 patents, numerous papers, the 1992 Allen Newell Research Excellence Award, the 1993 IEEE Speech Processing Best Paper Award, IEEE Fellow (2000), and ACM Fellow (2017). In 2021 his Azure AI team won the InfoWorld Technology of the Year award.

Human‑Level Speech Recognition Breakthrough

On 14 September 2016, Huang’s Microsoft speech team achieved a word‑error‑rate (WER) of 6.3 % on the Switchboard benchmark, setting a new record. One month later, on 18 October 2016, they reduced WER to 5.9 %, matching professional stenographers and surpassing most humans—a milestone Huang described as “historical” and the first time a computer could recognize every word in a conversation as well as a human.

He noted that the achievement was possible thanks to the collective effort of the speech team and the leadership of global executive VP Shen Xiangyang, as well as decades of AI research.

Although the breakthrough was demonstrated on English, Huang argued that Chinese speech recognition is easier because Chinese has only about four hundred syllables; in internal tests Chinese recognition rates were higher than English, and languages such as Italian, Spanish, and Chinese outperform French.

Following Microsoft’s announcement, former Baidu chief scientist Andrew Ng tweeted that Baidu had already surpassed human performance on Chinese speech in 2015, prompting Huang to clarify the difference between phrase‑level and dialog‑level error rates.

Engineering Behind the Breakthrough

The system relied on ten distinct neural networks. Six networks—including residual networks, LSTM, and variants of CNN—ran in parallel, each trained on up to 2 000 hours of data and containing more than 20 000 senones. The outputs of these six were then combined by four additional networks to produce the final result.

Among the CNN variants were a deeper VGG network adapted for speech, a residual network imported from Microsoft Research Asia’s ImageNet work, and LACE, a TDNN variant that aggregates weighted nonlinear transformations across layers.

LSTM layers were tuned to 512 hidden units per direction, as deeper LSTMs did not improve WER. Training was supervised, using not only the Switchboard corpus but also extensive internal data, and the resulting models power products such as XiaoIce, Cortana, and the Custom Speech Service.

CNTK Framework

The team developed CNTK, an open‑source deep‑learning toolkit that delivers 5‑10× faster LSTM training than other mainstream frameworks. CNTK supports flexible model definition, scales across multiple GPUs and servers, and can be programmed with Microsoft’s BrainScript or Python.

Although TensorFlow, Caffe, MXNet, and Torch are more widely known, CNTK was released earlier than TensorFlow, targets large‑scale AI training tasks, and consistently outperforms them on small CNNs and LSTM‑based RNNs, especially in multi‑GPU environments.

Establishing Microsoft China Research Institute

Huang joined Microsoft China Research Institute in 1993 and later helped plan and staff the institute’s establishment around 1997, recruiting Li Ka‑fu as its first director and selecting Beijing as the location.

Conclusion

After more than three decades at Microsoft, Huang Xuedong—affectionately called “old boy” by colleagues—continues to pursue his passion for technology as he embarks on a new chapter at Zoom.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Artificial Intelligence Deep Learning Microsoft speech recognition CNTK

Written by

21CTO

21CTO (21CTO.com) offers developers community, training, and services, making it your go‑to learning and service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.