Tagged articles

Speech LLM

3 articles · Page 1 of 1

May 29, 2026 · Artificial Intelligence

From Direct Transcription to Reasoning ASR and Parallel Decoding: CoT‑ASR vs Whisfusion

ASR is shifting from direct verbatim transcription to two new paradigms—Chain‑of‑Thought reasoning (CoT‑ASR) that cuts WER and entity error rates, and diffusion‑based parallel decoding (Whisfusion) that slashes latency by over eight times—offering complementary routes for smarter, faster speech recognition.

ASRChain-of-ThoughtCoT-ASR

0 likes · 12 min read

From Direct Transcription to Reasoning ASR and Parallel Decoding: CoT‑ASR vs Whisfusion

Machine Heart

May 27, 2026 · Artificial Intelligence

The Next Breakthrough for Speech LLMs: Turning Your Voice Model into a Prosody‑Aware Text Model

This article analyzes the CUHK paper that proposes TextPro‑SLM, a prosody‑aware text LLM architecture that reduces the speech‑text modality gap to as low as 0.7% using only about 1,000 hours of audio data, outperforming larger commercial models on semantic and prosody tasks.

MultimodalSpeech LLMmodality-gap

0 likes · 10 min read

The Next Breakthrough for Speech LLMs: Turning Your Voice Model into a Prosody‑Aware Text Model

Weekly Large Model Application

Mar 13, 2026 · Artificial Intelligence

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines

The article defines true speech large models as native end‑to‑end systems that directly map audio to audio, compares them with traditional cascade ASR‑LLM‑TTS pipelines across architecture, error control, latency, paralinguistic perception, long‑context handling and deployment, and surveys the leading open‑source and commercial speech LLMs released in March 2026 with a quick selection guide.

AIASREnd-to-End

0 likes · 11 min read

Speech Large Models: Why End-to-End Architecture Beats Traditional ASR‑LLM‑TTS Pipelines